Rdkit quick tips

Rdkit code snippets and recipes that I revisit now and again.
chemical-science
exploratory-data-analysis
machine-learning
resources
Published

October 10, 2021

Rdkit code snippets and recipes that I revisit now and again. The snippets are adopted from different python scripts written over time, ignore the variable names.

Tutorials & Walkthroughs

*Mute warnings

from rdkit import RDLogger   
RDLogger.DisableLog('rdApp.*') 

Fingerprints

Quick ECFP fingerprint

from rdkit.Chem import AllChem
from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator

# Convert to Chem.Mol: 
mol = Chem.MolFromSmiles(smiles)

# Counts by default - unfolded 
rdMolDescriptors.GetMorganFingerprint(mol, radius) 

# Folded counts 
rdMolDescriptors.GetHashedMorganFingerprint(mol, radius, nBits=num_bits)

#Folded FP bit vectors as per the size of the bits 
morgan_fp_bit_vect = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius, nBits=num_bits)

# Convert to numpy 
fp = np.zeros((0,), dtype=np.int16)
DataStructs.ConvertToNumpyArray(morgan_fp_bit_vect, fp)

Loading data

Tanimoto similarity matrix

Adapted from Andrew White:

import itertools
def tanimoto_matrix(slist):
    '''
    Compute pair-wise Tanimoto similarity between a list of smiles with ECFP4 FPs
    '''
    fp = [ AllChem.GetMorganFingerprint( Chem.MolFromSmiles(s), 2 ) for s in slist ]
    ts = list(
    DataStructs.cDataStructs.TanimotoSimilarity(x,y) for x, y, in itertools.product(fp, repeat=2)
    )
    return np.array(ts).reshape(len(fp), len(fp))

Loading ZINC dataset

Adapted from Andrew White:

tranches = pd.read_csv('https://gist.githubusercontent.com/whitead/f47887e45bbd2f38332182d2d422da6b/raw/a3948beac9b9034dab432b697c5ec238503ac5d0/tranches.txt')
def get_mol_batch(batch_size = 32):
  for t in tranches.values:
    d = pd.read_csv(t[0], sep=' ')    
    for i in range(len(d) // batch_size):
      yield d.iloc[i * batch_size:(i + 1) * batch_size, 0].values

Viewing molecules

SDF

Simple implementation

inf = open('./example.sdf','rb')
#import gzip 
#inf = gzip.open('gzip_file')
fsuppl = Chem.ForwardSDMolSupplier(inf)
mol_list = []
for mol in fsuppl:
  if mol is None:
    continue
  print(mol.GetNumAtoms())
  mol_list.append(mol)

As a Pandas DataFrame

sdf_df = PandasTools.LoadSDF('./example.sdf')
sdf_df['NumHeavyAtoms']= sdf_df.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)
PandasTools.WriteSDF(sdf_df, 'out.sdf', molColName='ROMol', idName='ID', properties=list(sdf_df.columns), allNumeric=False)

Viewing molecules inline

from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import MolsToGridImage 
IPythonConsole.ipython_useSVG=True  #SVG's tend to look nicer than the png counterparts

sample_mol_list = [Chem.MolFromSmiles(x) for x  in smiles_list]
MolsToGridImage(sample_mol_list, molsPerRow=5)

Viewing molecules in a grid

import pandas as pd
from rdkit.Chem import PandasTools

>> PandasTools.AddMoleculeColumnToFrame(df, smilesCol='smiles')
>> esol_data.head(1)

>> PandasTools.FrameToGridImage(df.head(8), legendsCol="logSolubility", molsPerRow=4)

Adding new values as a column

df["n_Atoms"] = df['ROMol'].map(lambda x: x.GetNumAtoms())
df.head(1)

Molecules in a xlsx file

import pandas as pd
from rdkit import Chem
from rdkit.Chem import PandasTools

>> smiles = ['c1ccccc1', 'c1ccccc1O', 'c1cc(O)ccc1O']
>> df = pd.DataFrame({'ID':['Benzene', 'Phenol', 'Hydroquinone'], 'SMILES':smiles})
>> df['Mol Image'] = [Chem.MolFromSmiles(s) for s in df['SMILES']]
>> PandasTools.SaveXlsxFromFrame(df, 'test.xlsx', molCol='Mol Image')

Viewing substructures

def viz_substruct(main_smile, substructure_smarts):
    
    mol_file = Chem.MolFromSmiles(main_smile)
    sub_pattern = Chem.MolFromSmarts(substructure_smarts)
    
    hit_ats = list(mol_file.GetSubstructMatch(sub_pattern)) # Returns the indices of the molecule’s atoms that match a substructure query
    hit_bonds = []

    for bond in sub_pattern.GetBonds():
        aid1 = hit_ats[bond.GetBeginAtomIdx()]
        aid2 = hit_ats[bond.GetEndAtomIdx()]

        hit_bonds.append( mol_file.GetBondBetweenAtoms(aid1, aid2).GetIdx() )

    d2d = rdMolDraw2D.MolDraw2DSVG(400, 400) # or MolDraw2DCairo to get PNGs
    rdMolDraw2D.PrepareAndDrawMolecule(d2d, mol_file, highlightAtoms=hit_ats,  highlightBonds=hit_bonds)
    d2d.FinishDrawing()
    return SVG(d2d.GetDrawingText())

>> diclofenac = 'O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl'
>> substruct_smarts = 'O=CCccN'
>> viz_substruct(diclofenac, substruct_smarts)

Quick substructure filter in-line to Pandas df

def NO_NO(smile):
    ''' Detects a Tau binder fragment'''
    mol = Chem.MolFromSmiles(smile)
    smt1= Chem.MolFromSmarts('[NR1][OR1]')
    if mol.HasSubstructMatch(smt1):
      return True
    else:
      return False

>> linker_2_stereos['Ring_NO'] = linker_2_stereos['SMILES'].apply(NO_NO)

Standardize molecules

Borrowed from here

from rdkit.Chem.MolStandardize import rdMolStandardize

def standardize(mol):
    # follows the steps in
    # https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb
    # as described **excellently** (by Greg) in
    # https://www.youtube.com/watch?v=eWTApNX8dJQ
    # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
    clean_mol = rdMolStandardize.Cleanup(mol) 
     
    # if many fragments, get the "parent" (the actual mol we are interested in) 
    parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
         
    # try to neutralize molecule
    uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
    uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
     
    # note that no attempt is made at reionization at this step
    # nor at ionization at some pH (rdkit has no pKa caculator)
    # the main aim to to represent all molecules from different sources
    # in a (single) standard way, for use in ML, catalogue, etc.
     
    te = rdMolStandardize.TautomerEnumerator() # idem
    taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
     
    return taut_uncharged_parent_clean_mol

Find and change atoms in the molecule

def find_UI_change_to_12(smile, attach_type='UI'):
  mol1 = Chem.MolFromSmiles(smile)
  sub_pattern = Chem.MolFromSmarts('[U]*.[I]*')
  hit_ats = list(mol1.GetSubstructMatch(sub_pattern))
  emol = Chem.EditableMol(mol1) #Chem.RWMol()
  emol.RemoveBond(hit_ats[0], hit_ats[1])
  emol.RemoveBond(hit_ats[2], hit_ats[3])
  mol2 = emol.GetMol()
  
  if attach_type == 'UI':
    # Attached to U 
    mol2.GetAtomWithIdx(hit_ats[1]).SetIsotope(2)
    # Attached to I 
    mol2.GetAtomWithIdx(hit_ats[3]).SetIsotope(1)
  
  else:
    # Attached to U 
    mol2.GetAtomWithIdx(hit_ats[1]).SetIsotope(1)
    # Attached to I
    mol2.GetAtomWithIdx(hit_ats[3]).SetIsotope(2)
    
  mol2 = standardize(mol2)
  return(mol2, Chem.MolToSmiles(mol2))

Reactions

View a reaction

rxn = AllChem.ReactionFromSmarts('[C:1]=[C:2].[C:3]=[*:4][*:5]=[C:6]>>[C:1]1[C:2][C:3][*:4]=[*:5][C:6]1')

Get changed atoms in a reaction: https://greglandrum.github.io/rdkit-blog/tutorial/reactions/2021/11/26/highlighting-changed-bonds-in-reactions.html

Run enumerations

Borrowed and inspired from Pat Walters gist

rxn = AllChem.ReactionFromSmarts("[#6:10]-[#7H2:9].[#7]-[c:4]1[c:5][c:6][c:7][c:8][c:3]1-[#6](-[OH])=O.[#6H:1](-[#6:2])=O>>[#6:10]-[#7:9]-c1n[c:1](-[#6:2])n[c:4]2[c:5][c:6][c:7][c:8][c:3]12")

react_dict = {'R1':[],'R2':[],'R3':[], 'product':[]}
for r1, r2, r3 in product( *[primary_amines_list, amino_benzoic_list, aldehyde_list]):
  react_dict['R1'].append(r1)
  react_dict['R2'].append(r2)
  react_dict['R3'].append(r3)
  r1_mol = Chem.MolFromSmiles(r1)
  r2_mol = Chem.MolFromSmiles(r2)
  r3_mol = Chem.MolFromSmiles(r3)
  for prod in rxn.RunReactants([r1_mol,r2_mol,r3_mol]):
    prod_mol = prod[0]
    Chem.SanitizeMol(prod_mol)
    react_dict['product'].append(Chem.MolToSmiles(prod_mol))

react_df = pd.DataFrame(react_dict)

Zip molecules

molzip lets you take a molecule containing multiple fragments and ‘zip’ them together. Atoms which should be bonded to the final molecule are labelled by connecting them to dummy atoms. The code looks at matching dummy atoms (with same isotopic labels) in the fragment and adds bonds between them.

sample = Chem.MolFromSmiles('[*:1]c1nc([*:2])nc([*:3])c1.CO[*:1].[*:2]N(CO)C.Cl[*:3]')
sample

Chem.molzip(sample)

RDKit documentation on Molzip and R-group decomposition

MolZip documentation and early version here

More examples here

Edit, merge, react molecules

Molecule tinkering using Rdkit: http://asteeves.github.io/blog/2015/01/14/editing-in-rdkit/

Using mol2grid

mols2grid is an interactive chemical viewer for 2D structures of small molecules, based on RDKit.

Useful set of functions in RDkit

Pat Walters has made a GitHub repo collecting useful RDKit functions here.