import os
import pandas as pd
import numpy as np
import tqdm.notebook as tqdm
Fingerprints are a low resolution representations of molecules. Historically they were devised, primarily, as a representation scheme to better estimate the similarity of molecules. Currently, besides the similairty and diversity estimations, fingerprints are the workhorses for most of the data-driven predictive models.
Key papers in this field: * Willett, Peter, John M. Barnard, and Geoffrey M. Downs. “Chemical similarity searching.” Journal of chemical information and computer sciences 38.6 (1998): 983-996 * PubChem Atom Environments
Blogs from Greg Landrum (RDKit creator): * Bit collisions in Rdkit * Simulating count fingerprints * Number of Fingerprint Bits * Number of unique fingerprint bits * RSC Open Science Standardization Talk
Resources to consider: * Depth-first on Fingerprints * Applied AI for materials online course by Logan Ward (Argonne National Lab) * Example scripts in GHOST implementation
# RDkit imports
import rdkit
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules
print(rdkit.__version__)
Chem.WrapLogs()= rdkit.RDLogger.logger()
lg lg.setLevel(rdkit.RDLogger.CRITICAL)
2021.09.3
#----- PLOTTING PARAMS ----#
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
# High DPI rendering for mac
%config InlineBackend.figure_format = 'retina'
# "Infinite" DPI vector output -- overkill
#%config InlineBackend.figure_format = 'svg'
# Plot matplotlib plots with white background:
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
# Load a test dataset
= 'https://raw.githubusercontent.com/pgg1610/data_files/main/delaney.smi'
url = pd.read_csv(url, header=0, names=['SMILES', 'logP']) x
3) x.head(
SMILES | logP | |
---|---|---|
0 | OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)... | -0.77 |
1 | Cc1occc1C(=O)Nc2ccccc2 | -3.30 |
2 | CC(C)=CCCC(C)=CC(=O) | -2.06 |
from rdkit.Chem import PandasTools
='SMILES') PandasTools.AddMoleculeColumnToFrame(x, smilesCol
2) x.sample(
SMILES | logP | ROMol | |
---|---|---|---|
1076 | CCCC(O)CCC | -1.40 | |
469 | Cc1cccc(C)c1O | -1.29 |
Types of fingerprints to consider:
https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-fingerprints
The types of atom pairs and torsions are normal (default), hashed and bit vector (bv). The types of the Morgan fingerprint are bit vector (bv, default) and count vector (count).
Descriptor based fingerprints - more information here
Count or binary-based fingerprints
2.1. Circular Fingerprints (Morgan) - Extended Connectivity (ECFP) or Feature Morgan
2.2. Atom pair
2.3. Torsion
2.4. MACCS Keys
2.5. RDkit
2.6. Avalon Tools
Data-driven fingerprints
Many of the RDKit’s fingerprints are available as either bit vectors or count vectors.
Bit vectors track whether or not features appear in a molecule while count vectors track the number of times each feature appears. It seems intuitive that a count vector is a better representation of similarity than bit vectors, but we often use bit vector representations for computational expediency - bit vectors require less memory and are much faster to operate on.
In the newer version of Rdkit there is are functions to simulate counts from bit vector.
Broadly there are two ways of invoking the fingerprints:
#Instance
rdMolDescriptors.GetMorganFingerprint() AllChem.GetMorganFingerprint()
#Generator
= rdFingerprintGenerator.GetMorganGenerator()
fpgen fpgen.GetFingerprint(x)
1. Morgan fingerprints
Some discussion on Bit Collisions here: https://rdkit.blogspot.com/2014/02/colliding-bits.html
Unfolded type or Sparse type
from rdkit.Chem import AllChem
from rdkit import DataStructs
= x.iloc[42]['ROMol'] mol_obj
mol_obj
#Default class to generate FPs - this will make it unfolded by default
= AllChem.GetMorganFingerprint(mol_obj, radius = 2) # Counts are used by default
morgan_fp_default
#Check the use_counts in Get Morgan Fingerprint this is used by default
= AllChem.GetMorganFingerprint(mol_obj, radius = 2, useCounts=True) # This is used by default
morgan_fp_default_counts
#Feature-based invariants, similar to those used for the FCFP fingerprints, can also be used.
# https://www.rdkit.org/docs/GettingStartedInPython.html#feature-definitions-used-in-the-morgan-fingerprints
# Here the atoms are grouped into functional classes before substructures are enumerated
= AllChem.GetMorganFingerprint(mol_obj, 2, useFeatures=True) morgan_fp_features
morgan_fp_default
<rdkit.DataStructs.cDataStructs.UIntSparseIntVect at 0x2b5a80661ad0>
You can also use the rdMolDescriptor
module in Rdkit for generating the fingerprints. What is the advatange of either?
from rdkit.Chem import rdMolDescriptors
= rdMolDescriptors.GetMorganFingerprint(mol_obj, radius=2) morgan_fp_default_rdmoldesc
morgan_fp_default_rdmoldesc.GetNonzeroElements()
{10565946: 3,
123085743: 1,
354000409: 1,
864942730: 3,
1451062712: 1,
2004727027: 2,
2119439498: 2,
2132511834: 2,
2142032900: 1,
2296493092: 2,
2422851564: 1,
2809361097: 2,
2968968094: 3,
2976816164: 1,
3038691098: 1,
3113921857: 2,
3217380708: 3,
3255046070: 1}
To get a list of bits being ‘active’ in the FPs:
DataStructs.cDataStructs.UIntSparseIntVect.GetNonzeroElements(morgan_fp_default)
{10565946: 3,
123085743: 1,
354000409: 1,
864942730: 3,
1451062712: 1,
2004727027: 2,
2119439498: 2,
2132511834: 2,
2142032900: 1,
2296493092: 2,
2422851564: 1,
2809361097: 2,
2968968094: 3,
2976816164: 1,
3038691098: 1,
3113921857: 2,
3217380708: 3,
3255046070: 1}
== morgan_fp_default_rdmoldesc.GetNonzeroElements() DataStructs.cDataStructs.UIntSparseIntVect.GetNonzeroElements(morgan_fp_default)
True
DataStructs.cDataStructs.UIntSparseIntVect.GetLength(morgan_fp_default)
4294967295
2**32
4294967296
morgan_fp_default_counts
<rdkit.DataStructs.cDataStructs.UIntSparseIntVect at 0x2b5a8066e620>
DataStructs.cDataStructs.UIntSparseIntVect.GetNonzeroElements(morgan_fp_default_counts)
{10565946: 3,
123085743: 1,
354000409: 1,
864942730: 3,
1451062712: 1,
2004727027: 2,
2119439498: 2,
2132511834: 2,
2142032900: 1,
2296493092: 2,
2422851564: 1,
2809361097: 2,
2968968094: 3,
2976816164: 1,
3038691098: 1,
3113921857: 2,
3217380708: 3,
3255046070: 1}
morgan_fp_features
<rdkit.DataStructs.cDataStructs.UIntSparseIntVect at 0x2b5a8066e7b0>
DataStructs.cDataStructs.UIntSparseIntVect.GetNonzeroElements(morgan_fp_features)
{0: 7,
1: 2,
2: 3,
613660066: 1,
614173363: 2,
1250412363: 2,
2407443532: 1,
2419043133: 1,
3064147743: 1,
3205496824: 3,
3208849907: 1,
3664239091: 2,
3766528779: 2,
3766532888: 3,
4203103696: 1}
DataStructs.cDataStructs.UIntSparseIntVect.GetLength(morgan_fp_features)
4294967295
Folded type
#Counts Folded FPs
= AllChem.GetMorganFingerprint(mol_obj, 2, useCounts=True, nBits=1024) # This will throw an error - cannot fold the full FPs instead call them separately morgan_fp_counts_folded
ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors.GetMorganFingerprint(Mol, int)
did not match C++ signature:
GetMorganFingerprint(RDKit::ROMol mol, unsigned int radius, boost::python::api::object invariants=[], boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, bool useCounts=True, boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)
Count-based Morgan FPs
#Hashed FPs - using counts
= AllChem.GetHashedMorganFingerprint(mol_obj, 2, nBits=1024) #Default is 2048 and be used without folding here morgan_fp_hashed_folded
morgan_fp_hashed_folded
<rdkit.DataStructs.cDataStructs.UIntSparseIntVect at 0x2b5a8066ef30>
DataStructs.cDataStructs.UIntSparseIntVect.GetLength(morgan_fp_hashed_folded)
1024
DataStructs.cDataStructs.UIntSparseIntVect.GetNonzeroElements(morgan_fp_hashed_folded)
{4: 1,
36: 3,
90: 2,
138: 2,
243: 2,
314: 3,
321: 2,
356: 3,
440: 1,
537: 1,
650: 3,
713: 2,
794: 1,
926: 3,
943: 1,
950: 1,
1004: 1}
Bit-vector based Morgan FP
#Folded FP bit vectors as per the size of the bits
= AllChem.GetMorganFingerprintAsBitVect(mol_obj, 2, nBits=1024) morgan_fp_bit_vect
morgan_fp_bit_vect
<rdkit.DataStructs.cDataStructs.ExplicitBitVect at 0x2b5a8066e4e0>
DataStructs.cDataStructs.ExplicitBitVect.GetNumBits(morgan_fp_bit_vect)
1024
DataStructs.cDataStructs.ExplicitBitVect.GetNumOnBits(morgan_fp_bit_vect)
17
Converting to friendly data type
= np.zeros((1,))
morgan_fp_hashed_folded_array = np.zeros((1,)) morgan_fp_bit_vect_array
DataStructs.ConvertToNumpyArray(morgan_fp_hashed_folded, morgan_fp_hashed_folded_array)
np.nonzero(morgan_fp_hashed_folded_array)
(array([ 4, 36, 90, 138, 243, 314, 321, 356, 440, 537, 650,
713, 794, 926, 943, 950, 1004]),)
np.unique(morgan_fp_hashed_folded_array)
array([0., 1., 2., 3.])
morgan_fp_bit_vect
<rdkit.DataStructs.cDataStructs.ExplicitBitVect at 0x2b5a8066e4e0>
DataStructs.ConvertToNumpyArray(morgan_fp_bit_vect, morgan_fp_bit_vect_array)
np.nonzero(morgan_fp_bit_vect_array)
(array([ 4, 36, 90, 138, 243, 314, 321, 356, 440, 537, 650,
713, 794, 926, 943, 950, 1004]),)
np.unique(morgan_fp_bit_vect_array)
array([0., 1.])
Converting the Morgan FPs between counts and bit-wise representation: https://stackoverflow.com/questions/54809506/how-can-i-compute-a-count-morgan-fingerprint-as-numpy-array
= AllChem.GetMorganFingerprintAsBitVect(mol_obj, 2, nBits=1024)
fp = np.zeros((0, ), dtype=np.int16)
array DataStructs.ConvertToNumpyArray(fp, array)
array
array([0, 0, 0, ..., 0, 0, 0], dtype=int16)
= "".join(array.astype(str))
bitstring = DataStructs.cDataStructs.CreateFromBitString(bitstring) fp2
fp2
<rdkit.DataStructs.cDataStructs.ExplicitBitVect at 0x2b5a80688b20>
list(fp.GetOnBits()) == list(fp2.GetOnBits())
True
= AllChem.GetHashedMorganFingerprint(mol_obj, 2, nBits=1024)
fp3 = np.zeros((0,), dtype=np.int8)
array
DataStructs.ConvertToNumpyArray(fp3, array)print(array.nonzero())
(array([ 4, 36, 90, 138, 243, 314, 321, 356, 440, 537, 650,
713, 794, 926, 943, 950, 1004]),)
Alternate way:
def numpy_2_fp(array):
= DataStructs.cDataStructs.UIntSparseIntVect(len(array))
fp for ix, value in enumerate(array):
= int(value)
fp[ix] return fp
= numpy_2_fp(array) fp4
== fp4.GetNonzeroElements() fp3.GetNonzeroElements()
True
2. MACCS
There is a SMARTS-based implementation of the 166 public MACCS keys.
from rdkit.Chem import MACCSkeys
= MACCSkeys.GenMACCSKeys(mol_obj) maccs_fp
maccs_fp
<rdkit.DataStructs.cDataStructs.ExplicitBitVect at 0x2b5a99f013a0>
= np.zeros((0,)) maccs_fp_array
DataStructs.ConvertToNumpyArray(maccs_fp, maccs_fp_array)
maccs_fp_array.shape
(167,)
np.nonzero(maccs_fp_array)
(array([ 11, 37, 43, 53, 66, 77, 80, 89, 90, 91, 92, 97, 98,
105, 106, 110, 112, 117, 118, 120, 121, 127, 131, 136, 137, 142,
143, 146, 147, 151, 154, 156, 158, 159, 161, 163, 164, 165]),)
np.unique(maccs_fp_array)
array([0., 1.])
3. Atom Pairs
from rdkit.Chem.AtomPairs import Pairs, Torsions
= Pairs.GetAtomPairFingerprint(mol_obj) # Fully unfolded ones - The standard form is as fingerprint including counts for each bit instead of just zeros and ones
ap_full = Pairs.GetAtomPairFingerprintAsBitVect(mol_obj) # Bit vector
ap_bit = Pairs.GetHashedAtomPairFingerprint(mol_obj, nBits=1024) #These are count based AP, default = 2048 ap_counts
ap_full
<rdkit.DataStructs.cDataStructs.IntSparseIntVect at 0x2b5a9a0e82b0>
DataStructs.cDataStructs.IntSparseIntVect.GetLength(ap_full)
8388608
DataStructs.cDataStructs.IntSparseIntVect.GetNonzeroElements(ap_full)
{558145: 2,
558146: 1,
590913: 2,
590914: 1,
705602: 4,
705603: 2,
705604: 2,
705605: 1,
705665: 2,
705667: 1,
705890: 3,
1082435: 4,
1082436: 2,
1082498: 2,
1082721: 4,
1082723: 2,
1083458: 1,
1721411: 4,
1721412: 2,
1721413: 2,
1721414: 1,
1721474: 2,
1721476: 1,
1721697: 3,
1721699: 6,
1722434: 4,
1722436: 2,
1723684: 3}
ap_bit
<rdkit.DataStructs.cDataStructs.SparseBitVect at 0x2b5a99f2ca70>
= np.zeros((0,), dtype=np.int32) ap_bit_array
DataStructs.ConvertToNumpyArray(ap_bit, ap_bit_array)
ArgumentError: Python argument types in
rdkit.DataStructs.cDataStructs.ConvertToNumpyArray(SparseBitVect, numpy.ndarray)
did not match C++ signature:
ConvertToNumpyArray(RDKit::SparseIntVect<unsigned long> bv, boost::python::api::object destArray)
ConvertToNumpyArray(RDKit::SparseIntVect<unsigned int> bv, boost::python::api::object destArray)
ConvertToNumpyArray(RDKit::SparseIntVect<long> bv, boost::python::api::object destArray)
ConvertToNumpyArray(RDKit::SparseIntVect<int> bv, boost::python::api::object destArray)
ConvertToNumpyArray(RDKit::DiscreteValueVect bv, boost::python::api::object destArray)
ConvertToNumpyArray(ExplicitBitVect bv, boost::python::api::object destArray)
= np.array((1,)) ap_counts_array
DataStructs.ConvertToNumpyArray(ap_counts, ap_counts_array)
ap_counts_array.shape
(1024,)
4. RDKIT Fingerprint
a Daylight-like fingerprint based on hashing molecular subgraphs
= Chem.RDKFingerprint(mol_obj) rdkit_fp
rdkit_fp
<rdkit.DataStructs.cDataStructs.ExplicitBitVect at 0x2b5aa2582350>
= np.zeros((0,)) rdkit_fp_array
DataStructs.ConvertToNumpyArray(rdkit_fp, rdkit_fp_array)
np.unique(rdkit_fp_array)
array([0., 1.])
5. Descriptors
Compute the RDKit2D fingerprint (200 topological properties) using the descriptastorus library. Adopted from https://github.com/rinikerlab/GHOST/blob/main/notebooks/library_example.ipynb
from descriptastorus.descriptors import rdDescriptors, rdNormalizedDescriptors
Can be looped with different methods
= [Chem.RDKFingerprint(x, fpSize=1024, minPath=1, maxPath=4) for x in suppl]
fps1 = [Chem.GetHashedMorganFingerprint(x, radius=2, nBits=1024) for x in suppl]
fps2 = [Chem.GetMorganFingerprint(x, radius=2, useCounts= True) for x in suppl]
fps3 = [Pairs.GetAtomPairFingerprintAsIntVect(x) for x in suppl]
fps4 = np.zeros((4,1024), dtype = np.int8)
arr for i in range(0,len(suppl)):
DataStructs.ConvertToNumpyArray(fps2[i], arr[i])print(arr)
Combine it together
General purpose function to generate morgan fingerprints using BaseEstimator
and TransformerMixin
from Scikit-learn to use all the wonderful scikit-learn routines
from sklearn.base import BaseEstimator, TransformerMixin
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import AllChem
from rdkit import Chem, DataStructs
from rdkit.Chem import MACCSkeys
from rdkit.Chem.AtomPairs import Pairs, Torsions
from rdkit import DataStructs
from rdkit.Chem import rdMolDescriptors
def mol_to_smiles(mol: Chem.Mol, canonical: bool = True) -> str:
"""Generate Smiles from mol.
:param mol: the input molecule
:param canonical: whether to return the canonical Smiles or not
:return: The Smiles of the molecule (canonical by default). NAN for failed molecules."""
if mol is None:
return np.nan
try:
= Chem.MolToSmiles(mol, canonical=canonical)
smi return smi
except:
return np.nan
def smiles_to_mol(smiles: str) -> Chem.Mol:
"""Generate a RDKit Molecule from a Smiles.
:param smiles: the input string
:returns: The RDKit Molecule. If the Smiles parsing failed, NAN is returned instead.
"""
try:
= Chem.MolFromSmiles(smiles)
mol if mol is not None:
return mol
return np.nan
except:
return np.nan
def transform_to_mol(smiles: list, smiles_column : str = None, to_binary : bool = False) -> Chem.Mol:
"""
Converts a list of smiles to RDKit mol object.
Provision to preserve molecules in form a binary object
Refer: https://www.rdkit.org/docs/GettingStartedInPython.html#preserving-molecules
"""
try:
if isinstance(smiles, list):
= [Chem.MolFromSmiles(x) for x in smiles]
mol_ls return mol_ls
elif isinstance(smiles, pd.DataFrame):
= smiles.copy() # To ensure I am using another instance
smiles "Mol"] = smiles.apply(Chem.MolFromSmiles)
smiles[return smiles
elif isinstance(smiles, str):
= Chem.MolFromSmiles(smiles)
mol_ls return mol_ls
else:
return np.nan
except:
return np.nan
if to_binary:
if type(smiles) == list:
= [mol.ToBinary() for mol in mol_ls] # m2 = Chem.Mol(binStr) to convert back
binary_out else:
= mol_ls.ToBinary()
binary_out
return binary_out
# Descriptor generation from RDKIT
def compute_fingerprint(smiles: str, type_fp: str = 'Morgan', sub_type: str = 'bv', radius: int = 2, num_bits: int = 2048) -> np.ndarray:
"""
Generates a fingerprint for a smiles string or mol object.
fp_type = {'Morgan', 'Morgan_sparse', 'FeatMorgan', 'FeatMorgan_sparse', 'AtomPair', 'AtomPair_sparse', 'RDkit', 'TT'}
sub_type = {'sparse', 'bv', 'counts', 'object'}
:param smiles: A smiles string for a molecule.
:param radius: The radius of the fingerprint.
:param num_bits: The number of bits to use in the fingerprint.
:param use_counts: Whether to use counts or just a bit vector for the fingerprint
:return: A 1-D numpy array containing the morgan fingerprint.
"""
if isinstance(smiles, str):
= Chem.MolFromSmiles(smiles)
mol
else:
= smiles
mol
if type_fp == 'Morgan':
if sub_type == 'sparse' or sub_type == 'object':
= rdMolDescriptors.GetMorganFingerprint(mol, radius) # Counts by default
fp_vect if sub_type == 'counts':
= rdMolDescriptors.GetHashedMorganFingerprint(mol, radius, nBits=num_bits)
fp_vect if sub_type == 'bv':
= rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius, nBits=num_bits)
fp_vect
if type_fp == 'FeatMorgan':
if sub_type == 'sparse' or sub_type == 'object':
= rdMolDescriptors.GetMorganFingerprint(mol, radius, useFeatures=True)
fp_vect if sub_type == 'counts':
= rdMolDescriptors.GetHashedMorganFingerprint(mol, radius, nBits=num_bits, useFeatures=True)
fp_vect if sub_type == 'bv':
= rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius, nBits=num_bits, useFeatures=True)
fp_vect
if type_fp == 'AtomPair':
if sub_type == 'sparse' or sub_type == 'object':
= rdMolDescriptors.GetAtomPairFingerprint(mol)
fp_vect if sub_type == 'counts':
= rdMolDescriptors.GetHashedAtomPairFingerprint(mol, nBits=num_bits)
fp_vect if sub_type == 'bv':
= rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(mol, nBits=num_bits)
fp_vect
if type_fp == 'MACCS':
= MACCSkeys.GenMACCSKeys(mol)
fp_vect
if type_fp == 'Rdkit':
= Chem.RDKFingerprint(mol)
fp_vect
if type_fp == 'TT':
if sub_type == 'counts':
= rdMolDescriptors.GetHashedTopologicalTorsionFingerprint(mol, nBits=num_bits) #2048 by default
fp_vect if sub_type == 'bv':
= rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect(mol, nBits=num_bits)
fp_vect
# Compute the RDKit2D fingerprint (200 topological properties) using the descriptastorus library
# Adopted from https://github.com/rinikerlab/GHOST/blob/main/notebooks/library_example.ipynb
if type_fp == 'Rdkit2D' or type_fp == 'Rdkit2D_norm':
try:
from descriptastorus.descriptors import rdDescriptors, rdNormalizedDescriptors
= rdNormalizedDescriptors.RDKit2DNormalized() if type_fp == 'Rdkit2D_norm' else rdDescriptors.RDKit2D()
generator
# process needs input as smiles
assert isinstance(smiles, str)
= smiles
smi = generator.process(smi)
data
if data[0] == True:
0)
data.pop(if data[0] == False:
0)
data.pop(
= np.float32(data)
data = np.finfo('float32').max
data[np.isposinf(data)] = np.finfo('float32').min
data[np.isneginf(data)] = np.nan_to_num(data)
fp_vect
except:
print('Failed to process the smiles using RDKit 2D features.')
= np.zeros(shape=(200))
fp_vect
if sub_type == 'object':
return fp_vect
elif sub_type == 'sparse':
try:
= fp_vect.GetNonzeroElements() , fp_vect.GetLength()
key_dict, len_arry = np.zeros(len(list(key_dict.keys())))
sparse_row = np.array(list(key_dict.keys()))
sparse_col = np.array(list(key_dict.values()), dtype=np.int16)
sparse_data = sp.sparse.csr_matrix((sparse_data, (sparse_row, sparse_col)), shape=(1, len_arry))
fp assert isinstance(fp, sp.sparse.csr_matrix)
except:
= fp_vect.GetNumOnBits(), fp_vect.GetNumBits()
key_dict, len_arry = fp
key_dict, len_arry = np.zeros(len(list(key_dict.keys())))
sparse_row = np.array(list(key_dict.keys()))
sparse_col = np.array(list(key_dict.values()), dtype=np.int16)
sparse_data = sp.sparse.csr_matrix((sparse_data, (sparse_row, sparse_col)), shape=(1, len_arry))
fp assert isinstance(fp, sp.sparse.csr_matrix)
else:
try:
= np.zeros((0,), dtype=np.int16)
fp
DataStructs.ConvertToNumpyArray(fp_vect, fp)except:
= fp_vect
fp return fp
## Have a dataframe output like Logan Ward does?
class FingerprintTransformer(BaseEstimator):
"""Class that converts SMILES strings to fingerprint vectors"""
def __init__(self, smiles_column: str = 'SMILES', type_fp: str = 'Morgan', sub_type: str = 'bv', radius: int = 2, num_bits: int = 1024):
self.smiles_column = smiles_column
self.type_fp = type_fp
self.sub_type = sub_type
self.num_bits = num_bits
self.radius = radius
def fit(self, X, y=None):
return self # Do need to do anything
def transform(self, X, y=None):
"""
Compute the fingerprints
:param X: List of SMILES strings
:return: Array of fingerprints
"""
if isinstance(X, list):
= transform_to_mol(X) if isinstance(X[0], Chem.Mol) else X
X_input = [ compute_fingerprint(m, type_fp = self.type_fp, sub_type = self.sub_type, radius = self.radius, num_bits = self.num_bits) for m in X_input ]
fing return np.vstack(fing)
elif isinstance(X, pd.DataFrame):
= X.copy() # To ensure I am changing a separate array instance
X = list(X[self.smiles_column])
X_mol_list = []
fing
for i, entry in enumerate(X_mol_list):
= self.type_fp, sub_type = self.sub_type, radius = self.radius, num_bits = self.num_bits) )
fing.append( compute_fingerprint(entry, type_fp
if self.type_fp == 'Rdkit2D' or self.type_fp == 'Rdkit2D_norm':
'{0}'.format(self.type_fp)] = fing
X[else:
'{0}-{1}'.format(self.type_fp, self.sub_type)] = fing
X[return X
else: #If it is a single entry
= X if isinstance(X, str) else transform_to_mol(X)
X_mol = compute_fingerprint(X_mol, type_fp = self.type_fp, sub_type = self.sub_type, radius = self.radius, num_bits = self.num_bits)
fing return fing
= FingerprintTransformer(num_bits=1024,
fp_transform =3,\
radius= 'SMILES',\
smiles_column = 'counts',\
sub_type = 'Morgan') type_fp
fp_transform.get_params()
{'num_bits': 1024,
'radius': 3,
'smiles_column': 'SMILES',
'sub_type': 'counts',
'type_fp': 'Morgan'}
= fp_transform.transform(x) x_fp
3) x_fp.sample(
SMILES | logP | ROMol | Morgan-counts | |
---|---|---|---|---|
846 | O=C1CNC(=O)N1 | -0.40 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | |
159 | Cc1ccc(Cl)cc1 | -3.08 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | |
502 | Cn2c(=O)on(c1ccc(Cl)c(Cl)c1)c2=O | -2.82 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
Similarity
In similarity you can use the unfolded fingerprints for comparing the molecules
= AllChem.GetMorganFingerprintAsBitVect(m1,2,nBits=1024)
fp1 >>> fp1
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...>
>>> fp2 = AllChem.GetMorganFingerprintAsBitVect(m2,2,nBits=1024)
>>> DataStructs.DiceSimilarity(fp1,fp2)
Usually unfolded
fingerprints are used for comparing molecules. Also if the FCFP
, called by useFeatures
tag is used in the comparison we can get different result.
>>> m1 = Chem.MolFromSmiles('c1ccccn1')
>>> m2 = Chem.MolFromSmiles('c1ccco1')
>>> fp1 = AllChem.GetMorganFingerprint(m1,2)
>>> fp2 = AllChem.GetMorganFingerprint(m2,2)
>>> ffp1 = AllChem.GetMorganFingerprint(m1,2,useFeatures=True)
>>> ffp2 = AllChem.GetMorganFingerprint(m2,2,useFeatures=True)
>>> DataStructs.DiceSimilarity(fp1,fp2)
0.36...
>>> DataStructs.DiceSimilarity(ffp1,ffp2)
0.90...
= x['ROMol'].iloc[0] ref_mol
ref_mol
= x['ROMol'].iloc[1] compared_to_mol
compared_to_mol
From the DNA-Encoded Library paper from Google
Quantification of molecular structure similarity used Tanimoto similarity on extended-connectivity fingerprints with radius 3 (ECFP6). We use a count-based representation (to better capture differences in molecular size with repeated substructures compared to binary fingerprints) and unhashed fingerprints (to avoid hash collisions).
ECFP6-count vectors were generated with RDKIT GetMorganFingerprint() method with useCounts=True argument.Functional-Class Fingerprints (FCFP) are related to ECFP, but atoms are grouped into functional classes such as “acidic”, “basic”, “aromatic”, etc. before substructures are enumerated.
Molecules that are similar structurally but have substitutions of similar atoms will look much more similar using FCFP than ECFP.
FCFP6 counts (also with radius 3) were generated with GetMorganFingerprint() with useCounts=True and useFeatures=True arguments.
= AllChem.GetMorganFingerprint(ref_mol, radius=3, useCounts=True) ref_mol_ECFP_counts
ref_mol_ECFP_counts
<rdkit.DataStructs.cDataStructs.UIntSparseIntVect at 0x2b5a9a06f3a0>
= AllChem.GetMorganFingerprint(compared_to_mol, radius=3, useCounts=True) compared_to_mol_ECFP_counts
compared_to_mol_ECFP_counts
<rdkit.DataStructs.cDataStructs.UIntSparseIntVect at 0x2b5a9a091760>
rdkit.DataStructs.cDataStructs.TanimotoSimilarity(ref_mol_ECFP_counts, compared_to_mol_ECFP_counts)
0.1111111111111111
import itertools
import seaborn as sns
def tanimoto_matrix(slist, smiles_columns : str = None, return_array : bool = True, show_plot : bool = False):
'''
Compute pair-wise Tanimoto similarity between a list of smiles with fingerprints
By default we are looking at the radius of 3 and Morgan ECFP
'''
= slist if isinstance(slist, list) else slist[smiles_columns].to_list()
slist = [ compute_fingerprint( s, type_fp = 'Morgan', sub_type = 'object', radius = 3 ) for s in slist ]
fp
= list(
ts for x, y, in itertools.product(fp, repeat=2)
DataStructs.cDataStructs.TanimotoSimilarity(x,y)
)
= np.array(ts).reshape(len(fp), len(fp))
tanimoto_array
if show_plot:
= plt.subplots(1,1, figsize=(10,10))
fig, ax = pd.DataFrame(tanimoto_array, index=slist, columns=slist)
df round(2).corr(), cmap="YlGnBu", ax=ax)
sns.heatmap(df.
if return_array:
return tanimoto_array
10)['SMILES'].tolist(), show_plot=True) tanimoto_matrix( x.sample(
array([[1. , 0.096 , 0.03100775, 0.07079646, 0.14723926,
0.08490566, 0.10638298, 0.10852713, 0.08490566, 0.08547009],
[0.096 , 1. , 0.04 , 0.17857143, 0.17857143,
0. , 0.1744186 , 0.1 , 0. , 0.16129032],
[0.03100775, 0.04 , 1. , 0.03333333, 0.04065041,
0. , 0.27631579, 0.05 , 0.01818182, 0.04615385],
[0.07079646, 0.17857143, 0.03333333, 1. , 0.09433962,
0. , 0.1038961 , 0.14285714, 0. , 0.24444444],
[0.14723926, 0.17857143, 0.04065041, 0.09433962, 1. ,
0.04761905, 0.16153846, 0.13114754, 0.06796117, 0.10909091],
[0.08490566, 0. , 0. , 0. , 0.04761905,
1. , 0.01282051, 0.06451613, 0.1875 , 0. ],
[0.10638298, 0.1744186 , 0.27631579, 0.1038961 , 0.16153846,
0.01282051, 1. , 0.09183673, 0.01282051, 0.12345679],
[0.10852713, 0.1 , 0.05 , 0.14285714, 0.13114754,
0.06451613, 0.09183673, 1. , 0.1 , 0.1641791 ],
[0.08490566, 0. , 0.01818182, 0. , 0.06796117,
0.1875 , 0.01282051, 0.1 , 1. , 0.02040816],
[0.08547009, 0.16129032, 0.04615385, 0.24444444, 0.10909091,
0. , 0.12345679, 0.1641791 , 0.02040816, 1. ]])