Last update: 8th February 2023

Noteworthy blogs to follow:

Online resources

Books

Bajorath, 2011. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology.
Heifetz, Alexander. (Ed.) (2022). “Artificial Intelligence in Drug Design.”

Best practices

Bender, Andreas, et al. “Evaluation guidelines for machine learning tools in the chemical sciences.” Nature Reviews Chemistry (2022): 1-15.. Temporary SharedIt Link

Nice account outlining guidelines for evaluating different AI/ML methodologies in molecular science. They propose a checklist of tests and best practices to assess the practicality and importance of different methodologies thereby providing a framework on how to evaluate plethora of ML workflows being proposed in different areas of chemical science. The basis for not overlooking the older non-ML method when evaluating the ‘new’ learning-based method, emphasis on model interpretation to translate the corrleation to chemical causality and finally

Artrith, Nongnuch, et al. “Best practices in machine learning for chemistry.” Nature chemistry 13.6 (2021): 505-508.

Set of rules, considerations, and caveats to keep in mind when designing ML model for chemical science. The authors propose a checklist when evaluating ML models, while intuitive at first, when lot of the new ML papers are scanned through that lens, you can identify the shortcommings of the proposed model. This checklist is especially helpful for those entering just entering the field.

Reviews

F. Strieth-Kalthoff, F. Sandfort, M. H. S. Segler, and F. Glorius, Machine learning the ropes: principles, applications and directions in synthetic chemistry, Chem. Soc. Rev

Pedagogical account of various machine learning techniques, models, representation schemes from perspective of synthetic chemistry. Covers different applications of machine learning in synthesis planning, property prediction, molecular design, and reactivity prediction

Paper outlining good practices for interpretating QSAR (Quantative Structure-Property Prediction) models. Good set of heuristics and comparison in the paper in terms of model interpretability. Create 6 synthetic datasets with varying complexity for QSAR tasks. The authors compare interpretability of graph-based methods to conventional QSAR methods. In regards to performance graph-based models show low interpretation compared to conventional QSAR method.

W. Patrick Walters & Regina Barzilay. Applications of Deep Learning in Molecule Generation and Molecular Property Prediction

Recent review summarising the state of the molecular property prediction and structure generation research. In spite of exciting recent advances in the modeling efforts, there is a need to generate better (realistic) training data, assess model prediction confidence, and metrics to quantify molecular generation performance.

Keith, John A., et al. “Combining machine learning and computational chemistry for predictive insights into chemical systems.” Chemical reviews 121.16 (2021): 9816-9872.

In-depth account of the machine learning and computational methods used in material science and small molecules. Nice introduction to the mathematics and theory behind first-principles based methods.

Review from Aspuru-Guzik and Allen’s group discussing how ML can be leveraged for various tasks in drug formulation tasks.

Industry-focused drug discovery reviews

Overview of methods and scope of computational methods used in the drug development process.

Special Journal Issues

This issue includes contributions that demonstrate the profound impact data science techniques have had in chemistry including chemical and materials synthesis, catalyst and materials design, and overhauling the models used in traditional theoretical or computational chemistry.

Meeting notes

Chemical modalities

Blanco, Maria-Jesus, and Kevin M. Gardinier. “New chemical modalities and strategic thinking in early drug discovery.” ACS medicinal chemistry letters 11.3 (2020): 228-231.

Overview of different chemical modalities currently at work to address different disease targets. The article addresses the small molecule medicinal chemists and how they can expand their outlook of small molecules to include other molecular entities when considering the angle of attack for different target engagement strategies. The authors offer a nice set of tools and thought process when selecting possible drug modalities for different target classes and what questions should be asked when zeroing in a possible mode of action.

Meta themes on optimizing small molecules

Retrospective analysis on factors influencing the bioavailability of drug candidates. Authors find rotatable bonds and polar surface area or hydrogen bond count (sum of donor and accpetors) found to be important predictors of good oral bioavailability. Compounds having <10 rotatable bonds and <140 A (or < 12 hydrogen bonds) have good chances of being orally bioavailable.

DeGoey, David A., et al. “Beyond the rule of 5: lessons learned from AbbVie’s drugs and compound collection: miniperspective.” Journal of Medicinal Chemistry 61.7 (2017): 2636-2651.

AB-MPS calculated using cLogD, the number of aromatic rings (nAr), and the number of rotatable bonds (nRotB) according to the formula AB-MPS = Abs(cLogD −3) + nAr + nRotB. The lower the AB-MPS score, the more likely the compound is to be absorbed, and a value of ≤14 is reported to predict a higher probability of oral absorption.

Poongavanam, Vasanthanathan, Bradley C. Doak, and Jan Kihlberg. “Opportunities and guidelines for discovery of orally absorbed drugs in beyond rule of 5 space.” Current Opinion in Chemical Biology 44 (2018): 23-29.

Hueristics for oral bioavailability of molecules that are violating the rule of 5. MW may reach up to approximately 1000 Da provided that TPSA increases proportionally up to 250 Å2. In contrast, cLogP and HBDs must be carefully controlled at high MW. Our lack of ability to predict compound conformations and flexibility is currently a hurdle that is critical to overcome to enable further prospective design in oral bRo5 space.

Synthesis Chemistry

Catalog of recent research articles that look at synthesis chemistry from a point of view of computational workflows, how traditional synthetic chemistry methods can be combined with informatics to augment drug discovery and synthesis processes.

Curated set of substrates to quickly assess the practicality of synthetic methods with the complete capture of success and failure, that can optimize reaction conditions with a broader scope with respect to relevant applications.

Large chemical libraries

Over the past few years several entites offering ultra-large ensembles of chemical libraries which can be made on-demand or purchased immediately have emerged. The existence of such services has reinvigorated the field of virtual screening and combinatorial library design. In addition, research groups have devised novel ways to navigate these libraries, more efficiently and also understand the differences in the chemical space these library cover. Following are some of the key papers in the field.

Warr, W. (2021). Report on an NIH Workshop on Ultralarge Chemistry Databases.
Warr, Wendy A., et al. “Exploration of ultralarge compound collections for drug discovery.” Journal of Chemical Information and Modeling 62.9 (2022): 2021-2034.
SpaceCompare: calculation of the overlap of large, nonenumerable combinatorial fragment spaces, utilizes topological fingerprints and the combinatorial character of these chemical spaces. Enamine’s REAL Space, WuXi’s GalaXi Space, and Otava’s CHEMriya. The overlap of the commercial make-on-demand catalogs is only in the low single-digit percent range, despite their large overall size.
PathFinder uses retrosynthetic analysis followed by combinatorial synthesis to generate novel compounds in synthetically accessible chemical space. https://pubs.acs.org/doi/10.1021/acs.jcim.9b00367

Binding free energetic calculations

Cheminformatics-focus

Catalog of recent reviews and manuscripts I have found useful when learning more about the state-of-the-art in Cheminformatics. I’ve tried to categorize them roughly based on their area of application:

Representation

Reviews

Articles

Comparative study of descriptor-based and graph-based models using public data set. Used descriptor-based models (XGBoost, RF, SVM, using ECFP) and compared them to graph-based models (GCN, GAT, AttentiveFP, MPNN). They show descriptor-based models outperform the graph-based models in terms of prediction accuracy and computational efficiency with SVM having best predictions. Graph-based methods are good for multi-task learning.

Predictive modeling

Fang, Xiaomin, et al. “Geometry-enhanced molecular representation learning for property prediction.” Nature Machine Intelligence (2022): 1-8.

Self-supervised learning using special type of GNN architecture (GeoGNN) that includes molecule geometric / spatial information. Geometry-enhanced molecular representation learning method (GEM). The model achieves SOTA performance on 14 of 15 public classification and regression datasets.

Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., Guzman-Perez, A., Hopper, T., Kelley, B., Mathea, M. and Palmer, A., 2019. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8), pp.3370-3388

Benchmark property prediction models on 19 public and 16 proprietary industrial data sets spanning a wide variety of chemical end points. Introduce a modeling framework (Chemprop) that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary data sets.

Stuyver, T. and Coley, C.W., 2021. Quantum chemistry-augmented neural networks for reactivity prediction: Performance, generalizability and interpretability. arXiv preprint arXiv:2107.10402

Combine structure (Graph-networks) and descriptor based features (QM-derived) to predict activation energies (E₂/SN₂ barrier height prediction) and regioselectivity. Incorporating QM and structure leads to better overall accuracy and generalizability even in low data regions. Atom and bond level features derived using QM and used in the model generation with a smaller dataset.

QSAR benchmarks

Exposing the Limitations of Molecular Machine Learning with Activity Cliffs

Account on how to treat and analyze activity cliffs in context of developing a predictive model. The authors outline best practices to probe activity cliffs. They show, using 24 DL and ML models and 30 targets, ML approaches based on molecular descriptors outperformed more complex deep learning methods. Activity cliff pairs were defined on similarity of the molecule SMILES and the bioactivity difference. Compared to most traditional machine learning approaches, deep neural networks seem to fall short at picking up subtle structural differences (and the corresponding property change) that give rise to activity cliffs.

Enumeration of chemical space

Subbaiah, Murugaiah AM, and Nicholas A. Meanwell. “Bioisosteres of the phenyl ring: Recent strategic applications in lead optimization and drug design.” Journal of Medicinal Chemistry 64.19 (2021): 14046-14128.

Looks at biosteric replacements for the phenyl rings in the lead optimization phase. Phenyl rings results in improve potency but have poor solubility and lipophilicitty. Find biosteres can be used to improve them.

Ertl, Peter. “Magic Rings: Navigation in the Ring Chemical Space Guided by the Bioactive Rings.” Journal of Chemical Information and Modeling (2021).

Analyze the nature of rings which appear in bioactive compounds. Ring systems are systematically extracted from one billion molecules and are analyzed to discover a structure or correlation in the bioactivity and type of rings. No simple set of structural descriptors separating active and inactive rings could be identified, the separation is best described by a neural network model taking into account a complex combination of many substructure features.

Bellmann, Louis, et al. “Comparison of Combinatorial Fragment Spaces and Its Application to Ultralarge Make-on-Demand Compound Catalogs.” Journal of Chemical Information and Modeling (2022).

Authors propose an algorithmic approach called as SpaceCompare to calculate overlap and diversity of the ultra-large combinatorial chemical libraries. The tool uses topological fragment spaces to capture the subtlties of the reaction having same product but different reactant substructures.

Zabolotna, Yuliana, et al. “NP navigator: a new look at the natural product chemical space.” Molecular informatics 40.9 (2021): 2100068..

Organizing the chemical space of ChEMBL, and ZINC to compare its overlap with natural products through COCONUT. Generative Topological Mapping is used for the clustering and analysis. Helpful overview of the method with its application to drug discovery can be found here

Explainable/Interpretable Machine Learning

Reviews/Perspectives

Rodríguez-Pérez, Raquel, and Jürgen Bajorath. “Explainable Machine Learning for Property Predictions in Compound Optimization.” Journal of medicinal chemistry 64.24 (2021): 17744-17752

Articles

Uncertainty quantification

Benchmark different models and uncertainty metrics for molecular property prediction.

Evidential Deep learning for guided molecular property prediction and disocovery Ava Soleimany, Conor Coley, et. al.. Slides

Train network to output the parameters of an evidential distribution. One forward-pass to find the uncertainty as opposed to dropout or ensemble - principled incorporation of uncertainties

Conduct a global multi-objective optimization with expected improvement criterion. Find transition metal complex redox couples for Redox flow batteries that address stability, solubility, and redox potential metric. Use distance of a point from a training data in latent space as a metric to quantify uncertainty.

J. P. Janet, C. Duan, T. Yang, A. Nandy, H. J. Kulik, Chem. Sci. 2019, 10, 7913–7922

Distance from available data in NN latent space is used as a variable for low-cost, quantitative uncertainty metric that works for both inorganic and organic chemistry. Introduce a technique to calibrate latent distances enabling conversion of distance-based metric to error estimates in units of predicted property

Active Learning

Active learning provides strategies for efficient screening of subsets of the library. In many cases, we can identify a large portion of the most promising molecules with a fraction of the compute cost.

Train property prediction model to output a distribution statistics in single pass that describes the uncertainty. This is in contrast to using ensemble models like MC dropout. Interesting way to estimate the epistemic (due to / from model) uncertainty in the prediction. Use this approach on antibiotic search problem of Stokes et. al. Compare Chemprop and SchNet models on different tasks.

Transfer Learning

Reviews

Cai, Chenjing, et al. “Transfer learning for drug discovery.” Journal of Medicinal Chemistry 63.16 (2020): 8683-8694.

Articles

Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning

Transfer learning by training a network to DFT data and then retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology, and chemistry, and billions of times faster than CCSD(T)/CBS calculations.

Improving the generative performance of chemical autoencoders through transfer learning

Meta Learning

Altae-Tran, H., Ramsundar, B., Pappu, A. S., & Pande, V. (2017). Low data drug discovery with one-shot learning. ACS central science, 3(4), 283-293.

Authors demonstrate how one-shot learning can be used to signifinicantly lower the amount of data required to make predictions in drug discovery tasks. LSTM combined with GCNNs is shown to improve learning capabilities of the model. In the simplest one-shot learning formalism these continuous vectors are then fed into a simple nearest-neighbor classifier that labels new examples by distance-weighted combination of support set labels

Nguyen, C. Q., Kreatsoulas, C., & Branson, K. M. (2020). Meta-learning GNN initializations for low-resource molecular property prediction. arXiv preprint arXiv:2003.05996.

Use CheMBL dataset to train a gated graph neural network (GGNN) for prediction and classification tasks using meta learning protocols. Show appreciable model performance even with just approx. 256 datapoints.

Federated Learning

Consortia comprising of leading resarch labs and companies working on decentralized datasets and predictive modeling of biochemical and cellular activity.

Generative models

Reviews

Correspondence on assessing the impact of AI on medicinal chemistry. It is a well written account on practical implication of generative design on pharmaceutical research.They outline two recent cases of ‘success’ of AI generative design in drug discovery and give more context and propose best practices for furthering the development of algorithms and drug discovery pipelines.

Very nice review of different atom-based, reaction-based, and fragment-based generative design workflows proposed by the community.

Benchmarks

Flam-Shepherd, Daniel, Kevin Zhu, and Alán Aspuru-Guzik. “Keeping it Simple: Language Models can learn Complex Molecular Distributions.” arXiv preprint arXiv:2112.03041 (2021).. Nature Comms Link

Test SOTA language models and representation performance against graph-based methods (CGVAE, JTVAE) for ‘challenging’ generative modeling tasks - generate a molecule - property distribution as a function of synthetic feasiblity. Graph models faced chanllenge in generating large molcules (> 100 HAs). Selfies provided advantage here. All of the models seem to generate novel molecules - how practical each of these novel molecules are is yet an open question.

MOSES - Benchmarking platform for generative models.

Propose a platform to deploy and compare state-of-the-art generative models for exploring molecular space on same dataset. In addition the authors also propose list of metrics to evaluate the quality and diversity of the generated structures.

GuacaMol: Benchmarking models for De Novo Molecular Design. Blogpost

Evaluation framework from BenevolentAI to compare different de-novo design models.

J. Zhang, R. Mercado, O. Engkvist, and H. Chen, “Comparative Study of Deep Generative Models on Chemical Space Coverage,” J. Chem. Inf. Model., vol. 61, no. 6, pp. 2572–2581, Jun. 2021.

Interesting analysis from team at AstraZeneca R&D. They look at the chemical space coverage accounted by the SOTA generative models. Proposes a metric for evaluating space coverage, and thereby comparing different SOTA models, using a reference data (GDB-13 in this case). The new metric computes how much of the GDB-13 dataset can be recovered by a model that is trained on small GDB subset. Generative models were trained on same 1M data points and 1B molecules were then sampled from each model. It was seen that at most 39% of the molecules in the parent dataset were sampled / generated by the model. Most models sampled the same compounds atleast twice. It was observed that graph-based model sampled much diverse molecules than string-based methods. Besides, the coverage of GAN-based models was worse compared to Language and Graph models.

Gao, W.; Coley, C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020

This paper looks at different ways of integrating synthesizability criteria into generative models.

Comparative analysis of graph traversal schemes for GraphINVENT

Bechmark work from AstraZeneca/MIT AI team to document different graph architecture schemes and algorithms for generative models.

Language models:

R. Gómez-Bombarelli et al., “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules,” ACS Cent. Sci., vol. 4, no. 2, pp. 268–276, 2018

One of the first implementation of a variation auto-encoder for molecule generation

Representation using SELFIES proposed to make it much more powerful

SMILES-based language model that generates molecules from scaffolds and can be trained from any arbitrary molecular set. Uses randomized SMILES to improve final prediction validity.

Iovanac, Nicolae C., Robert MacKnight, and Brett Savoie. “Actively Searching: Inverse Design of Novel Molecules with Simultaneously Optimized Properties.” ChemRxiv (2021)

Using quantum chemistry attributes calculated on-the-fly as scoring functions for sampling the generative model chemical space. Active learning strategy is deployed to explore the area of space where the properties of the molecules are unknown.

Graph-based

Flam-Shepherd, Daniel, Alexander Zhigalin, and Alán Aspuru-Guzik. “Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning.” arXiv preprint arXiv:2202.00658 (2022)

Reinforcement learning-based generative model whici is an update on point cloud approach by the same group to now incorporate ‘grammar’ for building molecules in form of functional groups in 3D space.

W. Jin, R. Barzilay, and T. Jaakkola, “Junction tree variational autoencoder for molecular graph generation,” 35th Int. Conf. Mach. Learn. ICML 2018, vol. 5, pp. 3632–3648, 2018

Junction tree based decoding. Define a grammar for the small molecule and find sub-units based on that grammar to construct a molecule. The molecule is generated in two-steps: first being generating the scaffold or backbone of the molelcule, then the nodes are added with molecular substructure as identified from the ‘molecular grammar’.

MPGVAE: Message passing graph networks for molecular generation, Daniel Flam-Shepherd et al 2021 Mach. Learn.: Sci. Technol.

Introduce a graph generation model by building a Message Passing Neural Network (MPNNs) into the encoder and decoder of a VAE (MPGVAE).

ConfVAE: End-to-end framework for molecular conformation generation via bilevel programming

Algorithm to predict 3D conforms from molecular graphs.

GraphINVENT: R. Mercado, T. Rastemo, E. Lindelöf, G. Klambauer and O. Engkvist, “Graph networks for molecular design,” Mach. Learn. Sci. Technol., vol. 2, no. 2, p. 25023, 2021. Github. Blogpost

GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time.

RL-GraphINVENT: Reinforcement learning-based variant of the above code.

GANs

MolGAN: An implicit generative model for small molecular graphs, N. De Cao and T. Kipf, 2018

Generative adversarial network for finding small molecules using graph networks, quite interesting. Avoids issues arising from node ordering that are associated with likelihood based methods by using an adversarial loss instead (GAN)

LatentGAN: A de novo molecular generation method using latent vector based generative adversarial network

Molecular generation strategy is described which combines an autoencoder and a GAN. Generator and discriminator network do not use SMILES strings as input, but instead n-dimensional vectors derived from the code-layer of an autoencoder trained as a SMILES heteroencoder that way syntax issues are expected to be addressed.

Scaffold-retained

Team at Novartis and Microsoft propose MoLeR, graph based model to generate molecule using scaffold as a seed. Scaffold based SAR speed up shown.

Reaction tranformation-based

Here the idea is to constraint the molecules generated by the transformations amenable to a particular platform, like automated synthesis workflow.

Authors propose a generative model to generate molecules via multi-step chemical reaction trees, each campaign first generates a reaction-tree with template transformations as breaking points.

Bradshaw, John, et al. “A model to search for synthesizable molecules.” Advances in Neural Information Processing Systems 32 (2019).

3D conformations-aware

Bolcato, Giovanni, Esther Heid, and Jonas Boström. “On the Value of Using 3D Shape and Electrostatic Similarities in Deep Generative Methods.” Journal of chemical information and modeling 62.6 (2022): 1388-1398.

Extension to the fragment-based generative design model (DeepFMPO) using reinforcement learning now incorporating 3D electrostatic similarity in the analysis. Ability to replace fragment with similar 3D shape and electrostatics. ESP_sim tutorial for comparison of electrostatic potential and molecule shape is used for this purpose. The authors find scaffold-hopping bioisoteres for CDK2.

Imrie, Fergus, et al. “Deep generative design with 3D pharmacophoric constraints.” Chemical science 12.43 (2021): 14577-14589.

Method that combines GNNs with CNNs to incorporate 3D pharmacophoric constraints into molecular generation.

Imrie, Fergus, et al. “Deep generative models for 3D linker design.” Journal of chemical information and modeling 60.4 (2020): 1983-1995.

Interesting work on designing linkers using conformation aware generative design algorithm. Think of it like fragment-growing.

Protein-ligand interactions aware

Zhang, Jie, and Hongming Chen. “De novo molecule design using molecular generative models constrained by ligand–protein interactions.” Journal of Chemical Information and Modeling 62.14 (2022): 3291-3306.

Linker design

Computer Aided Synthesis Planning (CASP)

Reviews:

Thakkar, Amol, et al. “Artificial intelligence and automation in computer aided synthesis planning.” Reaction chemistry & engineering 6.1 (2021): 27-51.

Perspective on the current SOTA of synthesis planning, automation, and reaction optimization in drug discovery and development phases using AI and ML.

Perspective on ML for organic chemistry reactivity prediction. Group uses DFT-derived physical features of the reaction molecules and conditions for representation. Small data set plus HTE experimentation dataset for yield estimation.

The Exploration of Chemical Reaction Networks

Perspective article summarising their position on the current state of research and future considerations on developing better reaction network models. Break down the analysis of reaction networks as into 3 classes (1) Front Open End: exploration of products from reactants (2) Backward Open Start: Know the product and explore potential reactants (3) Start to End: Product and reactant known, explore the likely intermediates.

Nice summary of potential challenges in the field:

Validating exploration algorithms on a consistent set of reaction system.
Need to generate a comparative metric to benchmark different algorithms.
Considering effect of solvents and/or protein embeddings in the analysis
- Previous review article by same group: Exploration of Reaction Pathways and Chemical Transformation Networks

Technical details of various algorithms being implemented for reaction mechanism discovery at the time of writing the review.

Best practices

Gimadiev, T. R., Lin, A., Afonina, V. A., Batyrshin, D., Nugmanov, R. I., Akhmetshin, T., … & Varnek, A. (2021). Reaction Data Curation I: Chemical Structures and Transformations Standardization. Molecular Informatics, 2100119.

Article from Varnek group on best practices on processing data for reaction informatics.

Benchmarking

Genheden S, Bjerrum E. PaRoutes: a framework for benchmarking retrosynthesis route predictions. ChemRxiv. Cambridge: Cambridge Open Engage; 2022. Github

Benchmarking framework for comparing different multi-step retrosynthesis methods from researchers at AstraZeneca R&D. Provides 10k synthetic routes which can be used as a validation set for different methodologies, providing a platform for systematic comparison of different methods being proposed in the community.

Classifying chemical reactions:

Schneider, N., et al. (2015). “Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity.” Journal of Chemical Information and Modeling 55(1): 39-53.

Using scrapped US Patent data to classify chemical reactions and deploy various fingerprints and ML models for classification.

Schwaller, Philippe, et al. “Mapping the space of chemical reactions using attention-based neural networks.” Nature Machine Intelligence 3.2 (2021): 144-152.. rxnfp - Github. Preprint. News Article.

Transformer-based model for reaction classification. Compared it with BERT. Besides classification, the work also formalizes the reaction fingerprint generation using the learned representations. The reaction fingerprints are visualized using TMAPS.

Reaction classifiction prediction using atom-mapped reaction that are used to generate condensed reaction graphs and passed through a GCN-variant as implemented in chemprop.

Atom mapping:

Lin, A., et al. (2021). “Atom-to-atom Mapping: A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies.”

Comparative analysis of different atom-mapping schemes for generating atom-mapped reaction features. Comments on the state of the art methods and their performance on a curated reaction database.

Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. RXMapper

Data-driven atom mapping schemes which uses transformers for learning the context of the chemical reaction. Researchers at IBM trained a flavor of language model based on Transformer architecture and used it to find reaction centers and maps atoms. Shown to be robust compared to other SOTA methods.

Automatic mapping of atoms across both simple and complex chemical reactions

Predicting reaction outcomes:

Template-free prediction of organic reaction outcomes using graph convolutional neural networks

Retrosynthetic routes:

Zabolotna, Y., et al. (2021). “SynthI: A New Open-Source Tool for Synthon-Based Library Design.” Journal of Chemical Information and Modeling.

Interesting work on de-novo design of molecules wherein, the molecules being created are made up from the fragments that is known to exist and are available to the user. New molecules are generated based on the fragmented (synthons) made available in the dataset.

Hybrid neural-symbolic approach for both retrosynthesis and reaction prediction that can be trained with large reaction sets from databases. Template extraction from known reaction datasets to classify new reaction to known reaction classes.

Fortunato, Michael E., et al. “Data augmentation and pretraining for template-based retrosynthetic prediction in computer-aided synthesis planning.” Journal of chemical information and modeling 60.7 (2020): 3398-3407.

In template-based retrosynthesis predictions, templates with few examples are excluded from training. This works talks on methods to augment the current set of data to account for the cases where examples for training are few.

Seidl, Philipp, et al. “Improving Few-and Zero-Shot Reaction Template Prediction Using Modern Hopfield Networks.” Journal of chemical information and modeling 62.9 (2022): 2111-2120.

Introduce a template-based single-step retrosynthesis model based on Modern Hopfield Networks, which learn an encoding of both molecules and reaction templates in order to predict the relevance of templates for a given molecule. The model does not consider templates as distinct categories, but can leverage structural information about the template. The retrieval approach enables generalization across templates, which makes zero-shot learning possible and improves few-shot learning. On the single-step retrosynthesis benchmark USPTO-50k, the MHN model reaction reaches the state-of-the-art at top-k accuracy for k ≥ 3.

Tu, Zhengkai, and Connor W. Coley. “Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction.” Journal of Chemical Information and Modeling (2021).

Graph2SMILES, a template-free retrosynthesis model to predict reaction outcomes and retrosynthesis routes. This model eliminates the need for any input-side SMILES augmentation, while achieving noticeable improvements over Transformer baselines (especially for top-1 accuracy).

Generate reaction networks:

M. Liu et al., “Reaction Mechanism Generator v3.0: Advances in Automatic Mechanism Generation,” J. Chem. Inf. Model., May 2021

Newest version of RMG (v3) is updated to Python v3. It has ability to generate heterogeneous catalyst models, uncertainty analysis to conduct first order sensitivity analysis. RMG dataset for the thermochemical and kinetic parameters have been expanded.

More and Faster: Simultaneously Improving Reaction Coverage and Computational Cost in Automated Reaction Prediction Tasks

Presents an algorithmic improvement to the reaction network prediction task through their YARP (Yet Another Reaction Program) methodology. Shown to reduce computational cost of optimization while improving the diversity of identified products and reaction pathways.

Look at exploration of reaction space rather than compound space. SOAP kernel for representing the moelcules. Estimate atomization energy for the molecules using ML. Calculate the d(AE) for different ML-estimated AEs. Reaction energies (RE) are estimated and uncertainty propogation is used to estimate the errors. Uncorrelated constant error propogation. 30,000 bond breaking reaction steps Rad-6-RE network used. RE prediction is not as good as AE.

Estimate molecular synthesizability

The idea of estimating whether a molecule is ‘synthesizable’ can be thought of from two areas:

Complexity based - compare the fragments in the molecule to the known fragments in the chemical space
Full retrosynthesis based - entire route is considered for molecule generation. Reactant complexity drives route complexity.

Ertl, Peter, and Ansgar Schuffenhauer. “Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.” Journal of cheminformatics 1.1 (2009): 1-11.. RDkit implementation

Synthetic Accessbility score (SA_Score) is a popular heuristic score for quantifying synthesizability. It computes a score using a fragment-contribution approach, where rarer fragments (as judged by their abundance in the PubChem database of 1mil representative cmpds) are taken as an indication of lower synthesizability.

Coley, Connor W., et al. “SCScore: synthetic complexity learned from a reaction corpus.” Journal of chemical information and modeling 58.2 (2018): 252-261.. DeepChem implementation

SCScore is a learned synthetic complexity score computed as a neural network model trained on reaction data from the Reaxys database. It was designed with synthesis planning in mind to operate on molecules resembling not just drug-like products but intermediates and simpler building blocks as well.

Liu, Cheng-Hao, et al. “RetroGNN: Fast Estimation of Synthesizability for Virtual Screening and De Novo Design by Learning from Slow Retrosynthesis Software.” Journal of Chemical Information and Modeling 62.10 (2022): 2293-2300.

RetroGNN is a graph neural network based model to predict outcome of a synthesis planner given the target molecule. Shown to better perform than SAScore. Code is yet to be released.

Data-driven chemistry modeling and reaction optimization

Review

Williams, Wendy L., et al. “The evolution of data-driven modeling in organic chemistry.” ACS central science 7.10 (2021): 1622-1637.

Articles

B. J. Shields et al., “Bayesian reaction optimization as a tool for chemical synthesis,” Nature, vol. 590, no. June 2020, p. 89, 2021. Github

Experimental design using Bayesian Optimization. Look at 3 rxn class with multiple reaction parameters - temp solvent ligand. Algorithm identifies the optimal conditions. Variables looked into: ligands, bases, solvents, temperatures, concentrations. Algorithm arrived at 99% yields consistently - which was possible by using unusual ligand not known to work well (cognitive bias).

Multi-objective optimization of catalytic reactions that employ chiral bisphosphine ligands. Optimization of 2 sequential reactions in asymmetric synthesis of API. Classification method identify active catalysts – 5% yield (user provided) cutoff for binary classification. Linear regression to model reaction selectivity. DFT-derived descriptor dataset of >550 bisphosphine ligands. Develop an interpretable chemical space mapping tool using PCA. Look at the domain of applicability with the euclidean distance in chemical space.

Zhang, Ying, et al. “Descriptor-Free Design of Multicomponent Catalysts.” ACS Catalysis 12 (2022): 10562-10571.

Bayesian optimization (BO) to improve the experimental measured activity as a direct function of compositional variables without educating physical knowledge to the machine. We applied BO in screening spinel Cr_aMn_bFe_cCo_dNi_eCu_fZn_{3–a–b–c–d–e–f}O₄ for the decomposition of nitric oxide into environmentally friendly nitrogen.

Databases

Automated chemistry workflows

Account of Eli Lilly and Company’s ASL (Automated Synthesis Lab)

DNA-encoded Libraries

Matthew Clark, et. al. DNA-encoded small-molecule libraries (DEL). C&EN article on the topic

New form of storing huge amounts of molecule related data using DNA. Made partially possible by low cost of DNA sequencing. Each molecule in the storage is attached with a DNA strand which encode information about its recipe.

Follow up to the work with Machine Learning for hit finding.

DNA encodings for discovery of novel small-molecule protein inhibitors. Outline a process for building a ML model using DEL. Compare graph convolutions to random forest for classification tasks with application to protein target binding. Graph models seemed to achieve high hit rate comapred to random forest. Apply diversity, logistical, structural filtering to search for novel candidates. First work to use GCN for hit searching.

Code / Packages:

Automates the selection of decision threshold for imbalanced classification task. The assumption for this method to work is the similar characteristics (like imbalance ratio) of training and test data.

MOSES - Benchmarking platform for generative models (PyTorch Implementation). Github

Benchmarking platform to implement molecular generative models. It also provides a set of metrics to evaluate the quality and diversity of the generated molecules. A benchmark dataset (subset of ZINC) is provided for training the models.

Reinvent 2.0 - an AI tool forr de novo drug design. Github

Production-ready tool for de novo design from Astra Zeneca. It can be effectively applied on drug discovery projects that are striving to resolve either exploration or exploitation problems while navigating the chemical space. Language model with SMILE output and trained by “randomizing” the SMILES representation of the input data. Implement reinforcement-leraning for directing the model towards relevant area of interest.

DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology - from Github

ChemProp (Pytorch)

Github repository for implmenting message passing neural networks for molecular property prediction as described in the paper Analyzing Learned Molecular Representations for Property Prediction by Yang et. al.

Chainer-Chemistry

“Chainer Chemistry is a deep learning framework (based on Chainer) with applications in Biology and Chemistry. It supports various state-of-the-art models (especially GCNN - Graph Convolutional Neural Network) for chemical property prediction” - from their Github repo introduction

Tool to generate chemical reaction networks. Includes Arkane, package for calculating thermodynamics from quantum mechanical calculations.

PyePAL

Active learning approach to efficiently and confidently identify the Pareto front with any regression model that can output a mean and a standard deviation.

rxnfp

Github repository to generate chemical reaction fingerprints from reaction SMILES.

mols2grid

Interactive chemical viewer for small molecules (RDKit wrapper)

molplotly

Spotfire like capabilities to jupyter notebook.

Datasets & Chemical libraries

Molecule datasets

PubChem: public sourced molecules
ChEMBL: bioactive molecules (most synthetic)
ZINC: collection of synthetic molecules (not all are bioactive)
QM 7/8/9: small molecules having not more than 7/8/9 heavy atoms
Papyrus
COCONUT: NP 400k there are some which are not NP
Mcule: Used in DEL enumerations
DrugBank
QMugs

QMugs (Quantum mechanical properties of drug-like molecules) collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling 2M conformers.

Reaction Datasets

USPTO
Pistachio
Reaxys
Open Reaction Database

Commericial (building block) vendors

eMolecules building blocks
Enamine REAL Space
WuXi GalaXi space
Otava’s CHEMriya

Helpful utilities:

RD-Kit
- Get Atom Indices in the SMILE:
- Datamol for manipulating RDKit molecules
Papers with code benchmark for QM9 energy predictions
MOSES: Molecular generation models benchmark
Therapeutics Data Commons “Therapeutics Data Commons is an open-science platform with AI/ML-ready datasets and learning tasks for therapeutics, spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaning”