Get SMILES from PubChem using DASK

Dask implementation to acquire CanonicalSMILES from PubChem using the pubchem API. At the end of the notebook there is another dask based implementation of using RDKit to get InChIKey from the SMILES. While Dask is not necessary required in the case of InChIKeys it is a much more elegant implementation of dask.dataframes and map_partitions

import time
import pubchempy as pcp
from pubchempy import Compound, get_compounds
import pandas as pd
import numpy as np
import re
import copy

/depot/jgreeley/apps/envs/ml_torch/lib/python3.6/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)

Get SMILES from Pubchem

Update: Parallelized using dask

df_100 = pd.read_csv('./DASK_SMILES/sample_chemical_names.csv', sep=',', header=0)

df_100.shape

(147, 1)

from dask.distributed import Client, progress
import dask.dataframe as dd
from dask import delayed, compute
from dask.multiprocessing import get
client = Client()
client

Client Scheduler: tcp://127.0.0.1:45859 Dashboard: http://127.0.0.1:8787/status	Cluster Workers: 4 Cores: 8 Memory: 39.85 GB

def get_smile(cmpd_name):
    try:
        #delayed(f)(x, args=a)
        name = delayed(pcp.get_properties)(['CanonicalSMILES'], cmpd_name, 'name')
        time.sleep(5)
        smile = name[0]['CanonicalSMILES']
    except:
        smile = 'X'
        print(cta_name, smile)
    return smile

def dask_smiles(df):
    df['CanonicalSMILES'] = df['CTA'].map(get_smile)
    return df #Map paritions works here -- but not with to_list() in the previous implementation

df_dask = dd.from_pandas(df_100, npartitions=10)

df_dask

Dask DataFrame Structure:

	CTA
npartitions=10
0	object
15	...
...	...
135	...
146	...

Dask Name: from_pandas, 10 tasks

df_dask.visualize()

%time ddf_out  = df_dask.map_partitions(dask_smiles)

CPU times: user 567 ms, sys: 92.3 ms, total: 660 ms
Wall time: 10 s

ddf_out.iloc[:,0]

Dask Series Structure:
npartitions=10
0      object
15        ...
        ...  
135       ...
146       ...
Name: CTA, dtype: object
Dask Name: getitem, 30 tasks

ddf_out.visualize()

%time results = ddf_out.persist(scheduler=client).compute()

CPU times: user 9.42 s, sys: 1.27 s, total: 10.7 s
Wall time: 2min 43s

type(results)

pandas.core.frame.DataFrame

results.loc[0]

CTA                                                     Cyclopropane
CanonicalSMILES    Delayed('getitem-e98dc8d7261c3d694a3c944735b3c...
Name: 0, dtype: object

compute(results['CanonicalSMILES'].iloc[0])[0] #Compute result for one entry

'C1CC1'

%time results['CanonicalSMILES'] = [value[0] for value in results['CanonicalSMILES'].map(compute)]

CPU times: user 3.73 s, sys: 443 ms, total: 4.17 s
Wall time: 31.1 s

type(results)

pandas.core.frame.DataFrame

results[results['CanonicalSMILES'] == 'X']

	CTA	CanonicalSMILES

results

	CTA	CanonicalSMILES
0	Cyclopropane	C1CC1
1	Ethylene	C=C
2	Methane	C
3	t-Butanol	CC(C)(C)O
4	ethane	CC
...	...	...
142	Cyclohexane-1,3-dicarbaldehyde	C1CC(CC(C1)C=O)C=O
143	isobutene	CC(=C)C
144	propanal	CCC=O
145	methyl methacrylate	CC(=C)C(=O)OC
146	vinyl acetate	CC(=O)OC=C

147 rows × 2 columns

results.to_pickle(“cta_smiles_table_100_less.pkl”)

## Dask to get InChIKey

This implementation in my opinion is more elegant use of dask’s apply command wrapper around conventional pandas apply. Also here we are defining the meta key for the variable since the code doesn’t seem to recognise the type of entries we expect in the final output

More information about meta here: https://docs.dask.org/en/latest/dataframe-api.html

import rdkit
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw

Chem.WrapLogs()
lg = rdkit.RDLogger.logger() 
lg.setLevel(rdkit.RDLogger.CRITICAL)

def get_InChiKey(x):
    try:
        inchi_key =  Chem.MolToInchiKey(Chem.MolFromSmiles(x))
    except:
        inchi_key = 'X'
    return inchi_key

def dask_smiles(df):
    df['INCHI'] = df['smiles'].map(get_name)
    return df

results_dask = dd.from_pandas(results, npartitions=10)

inchi = results_dask['CanonicalSMILES'].apply(lambda x: Chem.MolToInchiKey(Chem.MolFromSmiles(x)), meta=('inchi_key',str))

inchi

Dask Series Structure:
npartitions=10
0      object
15        ...
        ...  
135       ...
146       ...
Name: inchi_key, dtype: object
Dask Name: apply, 30 tasks

inchi.visualize()

inchi is a new Pandas series which has the delayed graphs for computing InChIKeys. We can compute it directly in the results dataframe as a new column. This is slightly different from the SMILES implementation above.

%time results['INCHI'] = compute(inchi, scheduler = client)[0]

CPU times: user 125 ms, sys: 17.3 ms, total: 142 ms
Wall time: 1.02 s

results

	CTA	CanonicalSMILES	INCHI
0	Cyclopropane	C1CC1	LVZWSLJZHVFIQJ-UHFFFAOYSA-N
1	Ethylene	C=C	VGGSQFUCUMXWEO-UHFFFAOYSA-N
2	Methane	C	VNWKTOKETHGBQD-UHFFFAOYSA-N
3	t-Butanol	CC(C)(C)O	DKGAVHZHDRPRBM-UHFFFAOYSA-N
4	ethane	CC	OTMSDBZUPAUEDD-UHFFFAOYSA-N
...	...	...	...
142	Cyclohexane-1,3-dicarbaldehyde	C1CC(CC(C1)C=O)C=O	WHKHKMGAZGBKCK-UHFFFAOYSA-N
143	isobutene	CC(=C)C	VQTUBCCKSQIDNK-UHFFFAOYSA-N
144	propanal	CCC=O	NBBJYMSMWIIQGU-UHFFFAOYSA-N
145	methyl methacrylate	CC(=C)C(=O)OC	VVQNEPGJFQJSBK-UHFFFAOYSA-N
146	vinyl acetate	CC(=O)OC=C	XTXRWKRVRITETP-UHFFFAOYSA-N

147 rows × 3 columns

Get SMILES from Pubchem

Client

Cluster

## Dask to get InChIKey