The code in this notebook is inspired from:
Pattern matching
In molecules
The SMART way
SMARTS (SMiles ARbitrary Target Specification) is a lanuguage used to search, select, and match a substructure pattern in a molecule SMILE. The idea of SMARTS is reminiscent of regular expressions (regex) for texts. Following some pre-defined rules for searching, SMARTS offer a powerful way to systematically search though a large corpus of molecules for a particular chemical phenotype.
More details on the SMARTS and its 'grammar' can be found on the Daylight's official page
Few words of caution:
- SMILES represent whole molecule (graph), SMARTS identify a substructure (subgraph).
- All SMILES are valid SMARTS
- It is better to be precise with your query than be general (you dont know what it might hit if not meticulous)
For instance, the SMILES O means an aliphatic oxygen with zero charge and two hydrogens, i.e. water. In SMARTS, the same expression means any aliphatic oxygen regardless of charge, hydrogen count, etc, e.g. it will match the oxygen in water, but also those in ethanol, acetone, molecular oxygen, hydroxy and hydronium ions, etc. Specifying [OH2] limits the pattern to match only water (this is also the fully specified SMILES for water).
Symbol | Symbol name | Atomic property requirements | Default |
---|---|---|---|
* | wildcard | any atom | (no default) |
a | aromatic | aromatic | (no default) |
A | aliphatic | aliphatic | (no default) |
D<n> | degree | <n> explicit connections | exactly one |
H<n> | total-H-count | <n> attached hydrogens | exactly one1 |
h<n> | implicit-H-count | <n> implicit hydrogens | at least one |
R<n> | ring membership | in <n> SSSR rings | any ring atom |
r<n> | ring size | in smallest SSSR ring of size <n> | any ring atom2 |
v<n> | valence | total bond order <n> | exactly one2 |
X<n> | connectivity | <n> total connections | exactly one2 |
x<n> | ring connectivity | <n> total ring connections | at least one2 |
- <n> | negative charge | -<n> charge | -1 charge (-- is -2, etc) |
+<n> | positive charge | +<n> formal charge | +1 charge (++ is +2, etc) |
#n | atomic number | atomic number <n> | (no default)2 |
@ | chirality | anticlockwise | anticlockwise, default class2 |
@@ | chirality | clockwise | clockwise, default class2 |
@<c><n> | chirality | chiral class <c> chirality <n> | (nodefault) |
@<c><n>? | chiral or unspec | chirality <c><n> or unspecified | (no default) |
<n> | atomic mass | explicit atomic mass | unspecified mass |
C | aliphatic carbon atom | |
c | aromatic carbon atom | |
a | aromatic atom | |
[#6] | carbon atom | |
[Ca] | calcium atom | |
[++] | atom with a +2 charge | |
[R] | atom in any ring | |
[D3] | atom with 3 explicit bonds (implicit H's don't count) | |
[X3] | atom with 3 total bonds (includes implicit H's) | |
[v3] | atom with bond orders totaling 3 (includes implicit H's) | |
C[C@H](F)O | match chirality (H-F-O anticlockwise viewed from C) | |
C[C@?H](F)O | matches if chirality is as specified or is not specified |
cc | any pair of attached aromatic carbons |
c:c | aromatic carbons joined by an aromatic bond |
c-c | aromatic carbons joined by a single bond (e.g. biphenyl). |
O | any aliphatic oxygen |
[O;H1] | simple hydroxy oxygen |
[O;D1] | 1-connected (hydroxy or hydroxide) oxygen |
[O;D2] | 2-connected (etheric) oxygen |
[C,c] | any carbon |
F,Cl,Br,I] | the 1st four halogens. |
[N;R] | must be aliphatic nitrogen AND in a ring |
[!C;R] | ( NOTaliphatic carbon ) AND in a ring |
[n;H1] | H-pyrrole nitrogen |
[n&H1] | same as above |
[c,n&H1] | any arom carbon OR H-pyrrole nitrogen |
[c,n;H1] | (arom carbon OR arom nitrogen) and exactly one H |
*!@* | two atoms connected by a non-ringbond |
*@;!:* | two atoms connected by a non-aromatic ringbond |
[C,c]=,#[C,c] | two carbons connected by a double or triple bond |
Example Reaction SMARTS: | |||
Query: | Target: | Matches: | Comment: |
C>>C | CC>>CC | 4 | No maps, normal match. |
C>>C | [CH3:7][CH3:8]>> [CH3:7][CH3:8] | 4 | No maps in query, maps in target are ignored. |
[C:1]>>C | [CH3:7][CH3:8]>> [CH3:7][CH3:8] | 4 | Unpaired map in query ignored. |
[C:1]>>[C:1] | CC>>CC | 0 | No maps in target, hence no matches. |
[C:?1]>>[C:?1] | CC>>CC | 4 | Query says mapped as shown or not present. |
[C:1]>>[C:1] | [CH3:7][CH3:8]>>[CH3:7][CH3:8] | 2 | Matches for target 7,7 and 8,8 atom pairs. |
[C:1]>>[C:2] | [CH3:7][CH3:8]>> [CH3:7][CH3:8] | 4 | When a query class is not found on both sides of the query, it is ignored; this query does NOT say that the atoms are in different classes. |
[C:1][C:1]>>[C:1] | [CH3:7][CH3:7]>> [CH3:7][CH3:7] | 4 | Atom maps match with "or" logic. All atoms get bound to class 7. |
[C:1][C:1]>>[C:1] | [CH3:7][CH3:8]>> [CH3:7][CH3:8] | 4 | The reactant atoms are bound to classes 7 and 8. Note that having the first query atom bound to class 7 does not preclude binding the second atom. Next, the product atom can bind to classes 7 or 8. |
[C:1][C:1]>>[C:1] | [CH3:7][CH3:7]>> [CH3:7][CH3:8] | 2 | The reactants are bound to class 7. The product atom can bind to class 7 only. |
# Install requirements for the tutorial
!pip install pandas rdkit-pypi mols2grid matplotlib scikit-learn ipywidgets
import os
import pandas as pd
import numpy as np
The majority of the basic molecular functionality is found in module rdkit.Chem
import rdkit
from rdkit import Chem #This gives us most of RDkits's functionality
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules
IPythonConsole.ipython_useSVG=True #SVG's tend to look nicer than the png counterparts
print(rdkit.__version__)
# Mute all errors except critical
Chem.WrapLogs()
lg = rdkit.RDLogger.logger()
lg.setLevel(rdkit.RDLogger.CRITICAL)
from collections import defaultdict
from rdkit.Chem.Draw import rdMolDraw2D
from IPython.display import SVG
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
# High DPI rendering for mac
%config InlineBackend.figure_format = 'retina'
# Plot matplotlib plots with white background:
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
plot_params = {
'font.size' : 15,
'axes.titlesize' : 15,
'axes.labelsize' : 15,
'axes.labelweight' : 'bold',
'xtick.labelsize' : 12,
'ytick.labelsize' : 12,
}
plt.rcParams.update(plot_params)
diclofenac = Chem.MolFromSmiles('O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl')
diclofenac
sub_pattern = Chem.MolFromSmarts('O=CCccN')
hit_ats = list(diclofenac.GetSubstructMatch(sub_pattern))
hit_bonds = []
for bond in sub_pattern.GetBonds():
aid1 = hit_ats[bond.GetBeginAtomIdx()]
aid2 = hit_ats[bond.GetEndAtomIdx()]
hit_bonds.append( diclofenac.GetBondBetweenAtoms(aid1, aid2).GetIdx() )
d2d = rdMolDraw2D.MolDraw2DSVG(400, 400) # or MolDraw2DCairo to get PNGs
rdMolDraw2D.PrepareAndDrawMolecule(d2d, diclofenac, highlightAtoms=hit_ats, highlightBonds=hit_bonds)
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Defining a function to make it easy and reproducible:
def viz_substruct(main_smile, substructure_smarts):
mol_file = Chem.MolFromSmiles(main_smile)
sub_pattern = Chem.MolFromSmarts(substructure_smarts)
hit_ats = list(mol_file.GetSubstructMatch(sub_pattern)) # Returns the indices of the molecule’s atoms that match a substructure query
hit_bonds = []
for bond in sub_pattern.GetBonds():
aid1 = hit_ats[bond.GetBeginAtomIdx()]
aid2 = hit_ats[bond.GetEndAtomIdx()]
hit_bonds.append( mol_file.GetBondBetweenAtoms(aid1, aid2).GetIdx() )
d2d = rdMolDraw2D.MolDraw2DSVG(400, 400) # or MolDraw2DCairo to get PNGs
rdMolDraw2D.PrepareAndDrawMolecule(d2d, mol_file, highlightAtoms=hit_ats, highlightBonds=hit_bonds)
d2d.FinishDrawing()
return SVG(d2d.GetDrawingText())
viz_substruct('C1=NC(=NC(=C1)C)C2=CC=CC=N2','[ar]!@[ar]')
Online resource to visualize the SMARTS.
This code is inspired from Pen Taka's blogpost
SMARTS.plus is a nice web GUI to visualize different SMART queries real-time.
from rdkit import Chem
from IPython.display import Image
import requests
import urllib
from time import time
baseurl = "https://smarts.plus/smartsview/download_rest?"
def get_img(query):
url = baseurl+query
start = time()
res = requests.get(url)
_img = Image(res.content, embed=True, retina=True)
print('Time taken: {0:0.2f} secs'.format(time() - start))
return _img
im0 = get_img("smarts=[ar]!@[ar]")
im0
im1 = get_img("smarts=[CX3](=[OX1])[OX2][CX3](=[OX1])")
im1
im0 = get_img("smarts=[CH1:1][OH:2].[OH][C:3]=[O:4]>>[C:1][O:2][C:3]=[O:4]")
im0