Dask implementation to acquire CanonicalSMILES
from PubChem using the pubchem
API. At the end of the notebook there is another dask based implementation of using RDKit
to get InChIKey from the SMILES. While Dask is not necessary required in the case of InChIKeys it is a much more elegant implementation of dask.dataframes
and map_partitions
import time
import pubchempy as pcp
from pubchempy import Compound, get_compounds
import pandas as pd
import numpy as np
import re
import copy
df_100 = pd.read_csv('./DASK_SMILES/sample_chemical_names.csv', sep=',', header=0)
df_100.shape
from dask.distributed import Client, progress
import dask.dataframe as dd
from dask import delayed, compute
from dask.multiprocessing import get
client = Client()
client
def get_smile(cmpd_name):
try:
#delayed(f)(x, args=a)
name = delayed(pcp.get_properties)(['CanonicalSMILES'], cmpd_name, 'name')
time.sleep(5)
smile = name[0]['CanonicalSMILES']
except:
smile = 'X'
print(cta_name, smile)
return smile
def dask_smiles(df):
df['CanonicalSMILES'] = df['CTA'].map(get_smile)
return df #Map paritions works here -- but not with to_list() in the previous implementation
df_dask = dd.from_pandas(df_100, npartitions=10)
df_dask
%time ddf_out = df_dask.map_partitions(dask_smiles)
ddf_out.iloc[:,0]
%time results = ddf_out.persist(scheduler=client).compute()
type(results)
results.loc[0]
compute(results['CanonicalSMILES'].iloc[0])[0] #Compute result for one entry
%time results['CanonicalSMILES'] = [value[0] for value in results['CanonicalSMILES'].map(compute)]
type(results)
results[results['CanonicalSMILES'] == 'X']
results
## Dask to get InChIKey
This implementation in my opinion is more elegant use of dask's apply
command wrapper around conventional pandas apply
. Also here we are defining the meta
key for the variable since the code doesn't seem to recognise the type of entries we expect in the final output
More information about meta
here: https://docs.dask.org/en/latest/dataframe-api.html
import rdkit
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
Chem.WrapLogs()
lg = rdkit.RDLogger.logger()
lg.setLevel(rdkit.RDLogger.CRITICAL)
def get_InChiKey(x):
try:
inchi_key = Chem.MolToInchiKey(Chem.MolFromSmiles(x))
except:
inchi_key = 'X'
return inchi_key
def dask_smiles(df):
df['INCHI'] = df['smiles'].map(get_name)
return df
results_dask = dd.from_pandas(results, npartitions=10)
inchi = results_dask['CanonicalSMILES'].apply(lambda x: Chem.MolToInchiKey(Chem.MolFromSmiles(x)), meta=('inchi_key',str))
inchi
inchi
is a new Pandas series which has the delayed
graphs for computing InChIKeys. We can compute it directly in the results
dataframe as a new column. This is slightly different from the SMILES implementation above.
%time results['INCHI'] = compute(inchi, scheduler = client)[0]
results