Get SMILES from PubChem using DASK

Use parallelized DASK methods to find SMILE and InChI keys for molecules
Published

September 18, 2020

Dask implementation to acquire CanonicalSMILES from PubChem using the pubchem API. At the end of the notebook there is another dask based implementation of using RDKit to get InChIKey from the SMILES. While Dask is not necessary required in the case of InChIKeys it is a much more elegant implementation of dask.dataframes and map_partitions

import time
import pubchempy as pcp
from pubchempy import Compound, get_compounds
import pandas as pd
import numpy as np
import re
import copy
/depot/jgreeley/apps/envs/ml_torch/lib/python3.6/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)

Get SMILES from Pubchem

Update: Parallelized using dask

df_100 = pd.read_csv('./DASK_SMILES/sample_chemical_names.csv', sep=',', header=0)
df_100.shape
(147, 1)
from dask.distributed import Client, progress
import dask.dataframe as dd
from dask import delayed, compute
from dask.multiprocessing import get
client = Client()
client

Client

Cluster

  • Workers: 4
  • Cores: 8
  • Memory: 39.85 GB
def get_smile(cmpd_name):
    try:
        #delayed(f)(x, args=a)
        name = delayed(pcp.get_properties)(['CanonicalSMILES'], cmpd_name, 'name')
        time.sleep(5)
        smile = name[0]['CanonicalSMILES']
    except:
        smile = 'X'
        print(cta_name, smile)
    return smile

def dask_smiles(df):
    df['CanonicalSMILES'] = df['CTA'].map(get_smile)
    return df #Map paritions works here -- but not with to_list() in the previous implementation 
df_dask = dd.from_pandas(df_100, npartitions=10)
df_dask
Dask DataFrame Structure:
CTA
npartitions=10
0 object
15 ...
... ...
135 ...
146 ...
Dask Name: from_pandas, 10 tasks

df_dask.visualize()

%time ddf_out  = df_dask.map_partitions(dask_smiles)
CPU times: user 567 ms, sys: 92.3 ms, total: 660 ms
Wall time: 10 s
ddf_out.iloc[:,0]
Dask Series Structure:
npartitions=10
0      object
15        ...
        ...  
135       ...
146       ...
Name: CTA, dtype: object
Dask Name: getitem, 30 tasks

ddf_out.visualize()

%time results = ddf_out.persist(scheduler=client).compute()
CPU times: user 9.42 s, sys: 1.27 s, total: 10.7 s
Wall time: 2min 43s
type(results)
pandas.core.frame.DataFrame
results.loc[0]
CTA                                                     Cyclopropane
CanonicalSMILES    Delayed('getitem-e98dc8d7261c3d694a3c944735b3c...
Name: 0, dtype: object
compute(results['CanonicalSMILES'].iloc[0])[0] #Compute result for one entry 
'C1CC1'
%time results['CanonicalSMILES'] = [value[0] for value in results['CanonicalSMILES'].map(compute)]
CPU times: user 3.73 s, sys: 443 ms, total: 4.17 s
Wall time: 31.1 s
type(results)
pandas.core.frame.DataFrame
results[results['CanonicalSMILES'] == 'X']
CTA CanonicalSMILES
results
CTA CanonicalSMILES
0 Cyclopropane C1CC1
1 Ethylene C=C
2 Methane C
3 t-Butanol CC(C)(C)O
4 ethane CC
... ... ...
142 Cyclohexane-1,3-dicarbaldehyde C1CC(CC(C1)C=O)C=O
143 isobutene CC(=C)C
144 propanal CCC=O
145 methyl methacrylate CC(=C)C(=O)OC
146 vinyl acetate CC(=O)OC=C

147 rows × 2 columns

results.to_pickle(“cta_smiles_table_100_less.pkl”)

## Dask to get InChIKey

This implementation in my opinion is more elegant use of dask’s apply command wrapper around conventional pandas apply. Also here we are defining the meta key for the variable since the code doesn’t seem to recognise the type of entries we expect in the final output

More information about meta here: https://docs.dask.org/en/latest/dataframe-api.html

import rdkit
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw

Chem.WrapLogs()
lg = rdkit.RDLogger.logger() 
lg.setLevel(rdkit.RDLogger.CRITICAL)
def get_InChiKey(x):
    try:
        inchi_key =  Chem.MolToInchiKey(Chem.MolFromSmiles(x))
    except:
        inchi_key = 'X'
    return inchi_key

def dask_smiles(df):
    df['INCHI'] = df['smiles'].map(get_name)
    return df
results_dask = dd.from_pandas(results, npartitions=10)
inchi = results_dask['CanonicalSMILES'].apply(lambda x: Chem.MolToInchiKey(Chem.MolFromSmiles(x)), meta=('inchi_key',str))
inchi
Dask Series Structure:
npartitions=10
0      object
15        ...
        ...  
135       ...
146       ...
Name: inchi_key, dtype: object
Dask Name: apply, 30 tasks

inchi.visualize()

inchi is a new Pandas series which has the delayed graphs for computing InChIKeys. We can compute it directly in the results dataframe as a new column. This is slightly different from the SMILES implementation above.

%time results['INCHI'] = compute(inchi, scheduler = client)[0]
CPU times: user 125 ms, sys: 17.3 ms, total: 142 ms
Wall time: 1.02 s
results
CTA CanonicalSMILES INCHI
0 Cyclopropane C1CC1 LVZWSLJZHVFIQJ-UHFFFAOYSA-N
1 Ethylene C=C VGGSQFUCUMXWEO-UHFFFAOYSA-N
2 Methane C VNWKTOKETHGBQD-UHFFFAOYSA-N
3 t-Butanol CC(C)(C)O DKGAVHZHDRPRBM-UHFFFAOYSA-N
4 ethane CC OTMSDBZUPAUEDD-UHFFFAOYSA-N
... ... ... ...
142 Cyclohexane-1,3-dicarbaldehyde C1CC(CC(C1)C=O)C=O WHKHKMGAZGBKCK-UHFFFAOYSA-N
143 isobutene CC(=C)C VQTUBCCKSQIDNK-UHFFFAOYSA-N
144 propanal CCC=O NBBJYMSMWIIQGU-UHFFFAOYSA-N
145 methyl methacrylate CC(=C)C(=O)OC VVQNEPGJFQJSBK-UHFFFAOYSA-N
146 vinyl acetate CC(=O)OC=C XTXRWKRVRITETP-UHFFFAOYSA-N

147 rows × 3 columns