Use parallelized DASK methods to find SMILE and InChI keys for molecules
Published
September 18, 2020
Dask implementation to acquire CanonicalSMILES from PubChem using the pubchem API. At the end of the notebook there is another dask based implementation of using RDKit to get InChIKey from the SMILES. While Dask is not necessary required in the case of InChIKeys it is a much more elegant implementation of dask.dataframes and map_partitions
import timeimport pubchempy as pcpfrom pubchempy import Compound, get_compoundsimport pandas as pdimport numpy as npimport reimport copy
/depot/jgreeley/apps/envs/ml_torch/lib/python3.6/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
def get_smile(cmpd_name):try:#delayed(f)(x, args=a) name = delayed(pcp.get_properties)(['CanonicalSMILES'], cmpd_name, 'name') time.sleep(5) smile = name[0]['CanonicalSMILES']except: smile ='X'print(cta_name, smile)return smiledef dask_smiles(df): df['CanonicalSMILES'] = df['CTA'].map(get_smile)return df #Map paritions works here -- but not with to_list() in the previous implementation
This implementation in my opinion is more elegant use of dask’s apply command wrapper around conventional pandas apply. Also here we are defining the meta key for the variable since the code doesn’t seem to recognise the type of entries we expect in the final output
More information about meta here: https://docs.dask.org/en/latest/dataframe-api.html
inchi is a new Pandas series which has the delayed graphs for computing InChIKeys. We can compute it directly in the results dataframe as a new column. This is slightly different from the SMILES implementation above.