Cheminformatics Literature and Resources
Compendium of recent articles, resources, and blogs in the area of Cheminformatics
- Noteworthy blogs to follow:
- Online resources
- Books
- Reviews
- Industry-focused drug discovery reviews
- Special Journal Issues
- Meeting notes
- Specific areas of interest
- Code / Packages:
- Datasets & Chemical libraries
- Helpful utilities:
Last update: 11th May 2022
Noteworthy blogs to follow:
Online resources
-
Pat Walters’ RSC CICAG Open Source Tools for Chemistry.Video. Github
-
Andrea Volkmer, TeachOpenCADD: a teaching platform for computer-aided drug design (CADD)
-
Chem LibreText collection from ACS Division of Chemical Education
Books
-
Bajorath, 2011. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology.
-
Heifetz, Alexander. (Ed.) (2022). “Artificial Intelligence in Drug Design.”
Reviews
Pedagogical account of various machine learning techniques, models, representation schemes from perspective of synthetic chemistry. Covers different applications of machine learning in synthesis planning, property prediction, molecular design, and reactivity prediction
- Mariia Matveieva & Pavel Polishchuk. Benchmarks for interpretation of QSAR models. Github. Patrick Walter’s blog on the paper.
Paper outlining good practices for interpretating QSAR (Quantative Structure-Property Prediction) models. Good set of heuristics and comparison in the paper in terms of model interpretability. Create 6 synthetic datasets with varying complexity for QSAR tasks. The authors compare interpretability of graph-based methods to conventional QSAR methods. In regards to performance graph-based models show low interpretation compared to conventional QSAR method.
Recent review summarising the state of the molecular property prediction and structure generation research. In spite of exciting recent advances in the modeling efforts, there is a need to generate better (realistic) training data, assess model prediction confidence, and metrics to quantify molecular generation performance.
-
Navigating through the Maze of Homogeneous Catalyst Design with Machine Learning
-
Coley, C. W. Defining and Exploring Chemical Spaces. Trends in Chemistry 2020
-
Applications of Deep learning in molecular generation and molecular property prediction
-
Utilising Graph Machine Learning within Drug Discovery and Development
Review from Aspuru-Guzik and Allen’s group discussing how ML can be leveraged for various tasks in drug formulation tasks.
Industry-focused drug discovery reviews
Overview of methods and scope of computational methods used in the drug development process.
Special Journal Issues
-
Nice collection of recent papers in Nature Communications on ML application and modeling
-
Journal of Medicinal Chemistry compendium of AI in Drug discovery issue
-
Account of Chemical Research Special Issue on advances in data-driven chemistry research
Meeting notes
-
Warr, W. (2021). National Institutes of Health (NIH) Workshop on Reaction Informatics
-
Warr, W. (2021). Report on an NIH Workshop on Ultralarge Chemistry Databases.
Specific areas of interest
Catalog of recent reviews and manuscripts I have found useful when learning more about the state-of-the-art in Cheminformatics. I’ve tried to categorize them roughly based on their area of application:
Representation
Reviews
Articles
Comparative study of descriptor-based and graph-based models using public data set. Used descriptor-based models (XGBoost, RF, SVM, using ECFP) and compared them to graph-based models (GCN, GAT, AttentiveFP, MPNN). They show descriptor-based models outperform the graph-based models in terms of prediction accuracy and computational efficiency with SVM having best predictions. Graph-based methods are good for multi-task learning.
Predictive modeling
Self-supervised learning using special type of GNN architecture (GeoGNN) that includes molecule geometric / spatial information. Geometry-enhanced molecular representation learning method (GEM). The model achieves SOTA performance on 14 of 15 public classification and regression datasets.
Benchmark property prediction models on 19 public and 16 proprietary industrial data sets spanning a wide variety of chemical end points. Introduce a modeling framework (Chemprop) that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary data sets.
Combine structure (Graph-networks) and descriptor based features (QM-derived) to predict activation energies (E2/SN2 barrier height prediction) and regioselectivity. Incorporating QM and structure leads to better overall accuracy and generalizability even in low data regions. Atom and bond level features derived using QM and used in the model generation with a smaller dataset.
QSAR benchmarks
Enumeration of chemical space
Looks at biosteric replacements for the phenyl rings in the lead optimization phase. Phenyl rings results in improve potency but have poor solubility and lipophilicitty. Find biosteres can be used to improve them.
Analyze the nature of rings which appear in bioactive compounds. Ring systems are systematically extracted from one billion molecules and are analyzed to discover a structure or correlation in the bioactivity and type of rings. No simple set of structural descriptors separating active and inactive rings could be identified, the separation is best described by a neural network model taking into account a complex combination of many substructure features.
Authors propose an algorithmic approach called as SpaceCompare to calculate overlap and diversity of the ultra-large combinatorial chemical libraries. The tool uses topological fragment spaces to capture the subtlties of the reaction having same product but different reactant substructures.
Organizing the chemical space of ChEMBL, and ZINC to compare its overlap with natural products through COCONUT. Generative Topological Mapping is used for the clustering and analysis. Helpful overview of the method with its application to drug discovery can be found here
Explainable/Interpretable Machine Learning
Reviews/Perspectives
Articles
-
Matveieva, Mariia, and Pavel Polishchuk. “Benchmarks for interpretation of QSAR models.” Journal of cheminformatics 13.1 (2021): 1-20. Patrick Walter’s blog
Uncertainty quantification
Benchmark different models and uncertainty metrics for molecular property prediction.
- Evidential Deep learning for guided molecular property prediction and disocovery Ava Soleimany, Conor Coley, et. al.. Slides
Train network to output the parameters of an evidential distribution. One forward-pass to find the uncertainty as opposed to dropout or ensemble - principled incorporation of uncertainties
-
Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks
-
J. P. Janet, S. Ramesh, C. Duan, H. J. Kulik, ACS Cent. Sci. 2020
Conduct a global multi-objective optimization with expected improvement criterion. Find transition metal complex redox couples for Redox flow batteries that address stability, solubility, and redox potential metric. Use distance of a point from a training data in latent space as a metric to quantify uncertainty.
Distance from available data in NN latent space is used as a variable for low-cost, quantitative uncertainty metric that works for both inorganic and organic chemistry. Introduce a technique to calibrate latent distances enabling conversion of distance-based metric to error estimates in units of predicted property
Active Learning
Active learning provides strategies for efficient screening of subsets of the library. In many cases, we can identify a large portion of the most promising molecules with a fraction of the compute cost.
-
B. J. Shields et al., “Bayesian reaction optimization as a tool for chemical synthesis,” Nature, vol. 590, no. June 2020, p. 89, 2021. Github
Experimental design using Bayesian Optimization.
- A. P. Soleimany, A. Amini, S. Goldman, D. Rus, S. N. Bhatia, and C. W. Coley, “Evidential Deep Learning for Guided Molecular Property Prediction and Discovery,” ACS Cent. Sci., Jul. 2021.. Slideshare
Train property prediction model to output a distribution statistics in single pass that describes the uncertainty. This is in contrast to using ensemble models like MC dropout. Interesting way to estimate the epistemic (due to / from model) uncertainty in the prediction. Use this approach on antibiotic search problem of Stokes et. al. Compare Chemprop and SchNet models on different tasks.
Transfer Learning
Reviews
** Articles**
Transfer learning by training a network to DFT data and then retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology, and chemistry, and billions of times faster than CCSD(T)/CBS calculations.
Meta Learning
Use CheMBL dataset to train a gated graph neural network (GGNN) for prediction and classification tasks using meta learning protocols. Show appreciable model performance even with just approx. 256 datapoints.
Federated Learning
Consortia comprising of leading resarch labs and companies working on decentralized datasets and predictive modeling of biochemical and cellular activity.
Generative models
Reviews
Benchmarks
Test SOTA language models and representation performance against graph-based methods (CGVAE, JTVAE) for ‘challenging’ generative modeling tasks - generate a molecule - property distribution as a function of synthetic feasiblity. Graph models faced chanllenge in generating large molcules (> 100 HAs). Selfies provided advantage here. All of the models seem to generate novel molecules - how practical each of these novel molecules are is yet an open question.
Propose a platform to deploy and compare state-of-the-art generative models for exploring molecular space on same dataset. In addition the authors also propose list of metrics to evaluate the quality and diversity of the generated structures.
Evaluation framework from BenevolentAI to compare different de-novo design models.
Interesting analysis from team at AstraZeneca R&D. They look at the chemical space coverage accounted by the SOTA generative models. Proposes a metric for evaluating space coverage, and thereby comparing different SOTA models, using a reference data (GDB-13 in this case). The new metric computes how much of the GDB-13 dataset can be recovered by a model that is trained on small GDB subset. Generative models were trained on same 1M data points and 1B molecules were then sampled from each model. It was seen that at most 39% of the molecules in the parent dataset were sampled / generated by the model. Most models sampled the same compounds atleast twice. It was observed that graph-based model sampled much diverse molecules than string-based methods. Besides, the coverage of GAN-based models was worse compared to Language and Graph models.
This paper looks at different ways of integrating synthesizability criteria into generative models.
Bechmark work from AstraZeneca/MIT AI team to document different graph architecture schemes and algorithms for generative models.
Language models:
One of the first implementation of a variation auto-encoder for molecule generation
Representation using SELFIES proposed to make it much more powerful
-
Reproducibility study of the STONED work from Jablonka et. al.
-
LSTM based (RNN) approaches to small molecule generation. Github
-
SMILES-based deep generative scaffold decorator for de-novo drug design. Github
SMILES-based language model that generates molecules from scaffolds and can be trained from any arbitrary molecular set. Uses randomized SMILES to improve final prediction validity.
Graph-based
Reinforcement learning-based generative model whici is an update on point cloud approach by the same group to now incorporate ‘grammar’ for building molecules in form of functional groups in 3D space.
Junction tree based decoding. Define a grammar for the small molecule and find sub-units based on that grammar to construct a molecule. The molecule is generated in two-steps: first being generating the scaffold or backbone of the molelcule, then the nodes are added with molecular substructure as identified from the ‘molecular grammar’.
Introduce a graph generation model by building a Message Passing Neural Network (MPNNs) into the encoder and decoder of a VAE (MPGVAE).
Algorithm to predict 3D conforms from molecular graphs.
- GraphINVENT: R. Mercado, T. Rastemo, E. Lindelöf, G. Klambauer and O. Engkvist, “Graph networks for molecular design,” Mach. Learn. Sci. Technol., vol. 2, no. 2, p. 25023, 2021. Github. Blogpost
GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time.
GANs
Generative adversarial network for finding small molecules using graph networks, quite interesting. Avoids issues arising from node ordering that are associated with likelihood based methods by using an adversarial loss instead (GAN)
Molecular generation strategy is described which combines an autoencoder and a GAN. Generator and discriminator network do not use SMILES strings as input, but instead n-dimensional vectors derived from the code-layer of an autoencoder trained as a SMILES heteroencoder that way syntax issues are expected to be addressed.
Scaffold-retained
Team at Novartis and Microsoft propose MoLeR, graph based model to generate molecule using scaffold as a seed. Scaffold based SAR speed up shown.
Scoring functions
Extension to the fragment-based reinforcement learning methods for generating novel compounds. Comparison of 3D molecular fragments to aid in identifying bioactive conformations.
Using quantum chemistry attributes calculated on-the-fly as scoring functions for sampling the generative model chemical space. Active learning strategy is deployed to explore the area of space where the properties of the molecules are unknown.
Computer Aided Synthesis Planning (CASP)
Reviews:
Perspective article summarising their position on the current state of research and future considerations on developing better reaction network models. Break down the analysis of reaction networks as into 3 classes (1) Front Open End: exploration of products from reactants (2) Backward Open Start: Know the product and explore potential reactants (3) Start to End: Product and reactant known, explore the likely intermediates.
Nice summary of potential challenges in the field:
- Validating exploration algorithms on a consistent set of reaction system.
- Need to generate a comparative metric to benchmark different algorithms.
-
Considering effect of solvents and/or protein embeddings in the analysis
- Previous review article by same group: Exploration of Reaction Pathways and Chemical Transformation Networks
Technical details of various algorithms being implemented for reaction mechanism discovery at the time of writing the review.
Best practices
Article from Varnek group on best practices on processing data for reaction informatics.
Benchmarking
- Genheden S, Bjerrum E. PaRoutes: a framework for benchmarking retrosynthesis route predictions. ChemRxiv. Cambridge: Cambridge Open Engage; 2022. Github
Benchmarking framework for comparing different multi-step retrosynthesis methods from researchers at AstraZeneca R&D. Provides 10k synthetic routes which can be used as a validation set for different methodologies, providing a platform for systematic comparison of different methods being proposed in the community.
Classifying chemical reactions:
Using scrapped US Patent data to classify chemical reactions and deploy various fingerprints and ML models for classification.
- Schwaller, Philippe, et al. “Mapping the space of chemical reactions using attention-based neural networks.” Nature Machine Intelligence 3.2 (2021): 144-152.. rxnfp - Github. Preprint. News Article.
Transformer-based model for reaction classification. Compared it with BERT. Besides classification, the work also formalizes the reaction fingerprint generation using the learned representations. The reaction fingerprints are visualized using TMAPS.
Reaction classifiction prediction using atom-mapped reaction that are used to generate condensed reaction graphs and passed through a GCN-variant as implemented in chemprop.
Atom mapping:
Comparative analysis of different atom-mapping schemes for generating atom-mapped reaction features. Comments on the state of the art methods and their performance on a curated reaction database.
Data-driven atom mapping schemes which uses transformers for learning the context of the chemical reaction. Researchers at IBM trained a flavor of language model based on Transformer architecture and used it to find reaction centers and maps atoms. Shown to be robust compared to other SOTA methods.
Predicting reaction outcomes:
Template-free prediction of organic reaction outcomes using graph convolutional neural networks
Retrosynthetic routes:
Interesting work on de-novo design of molecules wherein, the molecules being created are made up from the fragments that is known to exist and are available to the user. New molecules are generated based on the fragmented (synthons) made available in the dataset.
Generation reaction networks:
Newest version of RMG (v3) is updated to Python v3. It has ability to generate heterogeneous catalyst models, uncertainty analysis to conduct first order sensitivity analysis. RMG dataset for the thermochemical and kinetic parameters have been expanded.
Presents an algorithmic improvement to the reaction network prediction task through their YARP (Yet Another Reaction Program) methodology. Shown to reduce computational cost of optimization while improving the diversity of identified products and reaction pathways.
- Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction
-
Automatic discovery of chemical reactions using imposed activation
- Machine learning in chemical reaction space
Look at exploration of reaction space rather than compound space. SOAP kernel for representing the moelcules. Estimate atomization energy for the molecules using ML. Calculate the d(AE) for different ML-estimated AEs. Reaction energies (RE) are estimated and uncertainty propogation is used to estimate the errors. Uncorrelated constant error propogation. 30,000 bond breaking reaction steps Rad-6-RE network used. RE prediction is not as good as AE.
Databases
DNA-encoded Libraries
New form of storing huge amounts of molecule related data using DNA. Made partially possible by low cost of DNA sequencing. Each molecule in the storage is attached with a DNA strand which encode information about its recipe.
DNA encodings for discovery of novel small-molecule protein inhibitors. Outline a process for building a ML model using DEL. Compare graph convolutions to random forest for classification tasks with application to protein target binding. Graph models seemed to achieve high hit rate comapred to random forest. Apply diversity, logistical, structural filtering to search for novel candidates. First work to use GCN for hit searching.
Code / Packages:
Automates the selection of decision threshold for imbalanced classification task. The assumption for this method to work is the similar characteristics (like imbalance ratio) of training and test data.
Benchmarking platform to implement molecular generative models. It also provides a set of metrics to evaluate the quality and diversity of the generated molecules. A benchmark dataset (subset of ZINC) is provided for training the models.
Production-ready tool for de novo design from Astra Zeneca. It can be effectively applied on drug discovery projects that are striving to resolve either exploration or exploitation problems while navigating the chemical space. Language model with SMILE output and trained by “randomizing” the SMILES representation of the input data. Implement reinforcement-leraning for directing the model towards relevant area of interest.
DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology - from Github
Github repository for implmenting message passing neural networks for molecular property prediction as described in the paper Analyzing Learned Molecular Representations for Property Prediction by Yang et. al.
“Chainer Chemistry is a deep learning framework (based on Chainer) with applications in Biology and Chemistry. It supports various state-of-the-art models (especially GCNN - Graph Convolutional Neural Network) for chemical property prediction” - from their Github repo introduction
-
DimeNet++ - extension of Directional message pasing working (DimeNet). Github
-
BondNet - Graph neural network model for predicting bond dissociation energies, considers both homolytic and heterolytic bond breaking. Github
Tool to generate chemical reaction networks. Includes Arkane, package for calculating thermodynamics from quantum mechanical calculations.
Active learning approach to efficiently and confidently identify the Pareto front with any regression model that can output a mean and a standard deviation.
Github repository to generate chemical reaction fingerprints from reaction SMILES.
Interactive chemical viewer for small molecules (RDKit wrapper)
Spotfire like capabilities to jupyter notebook.
Datasets & Chemical libraries
Molecule datasets
-
PubChem: public sourced molecules
-
ChEMBL: bioactive molecules (most synthetic)
-
ZINC: collection of synthetic molecules (not all are bioactive)
-
QM 7/8/9: small molecules having not more than 7/8/9 heavy atoms
-
COCONUT: NP 400k there are some which are not NP
-
Mcule: Used in DEL enumerations
Commericial (building block) vendors
-
eMolecules building blocks
-
Enamine REAL Space
-
WuXi GalaXi space
-
Otava’s CHEMriya
Helpful utilities:
- RD-Kit
- Therapeutics Data Commons “Therapeutics Data Commons is an open-science platform with AI/ML-ready datasets and learning tasks for therapeutics, spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaning”