Omnis tempus datum - Medicine drug discovery resources

Last update: December 2024

Noteworthy blogs to follow:

Cheminformatics

Fragment-based drug dicovery

Practical Fragments

Medicinal chemistry

Computational chemistry 1. Gilles Ouvry

General field

Online resources

Books

Bajorath, 2011. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology.
Heifetz, Alexander. (Ed.) (2022). “Artificial Intelligence in Drug Design.”

Best practices

Bender, Andreas, et al. “Evaluation guidelines for machine learning tools in the chemical sciences.” Nature Reviews Chemistry (2022): 1-15.. Temporary SharedIt Link

Nice account outlining guidelines for evaluating different AI/ML methodologies in molecular science. They propose a checklist of tests and best practices to assess the practicality and importance of different methodologies thereby providing a framework on how to evaluate plethora of ML workflows being proposed in different areas of chemical science. The basis for not overlooking the older non-ML method when evaluating the ‘new’ learning-based method, emphasis on model interpretation to translate the corrleation to chemical causality and finally

Artrith, Nongnuch, et al. “Best practices in machine learning for chemistry.” Nature chemistry 13.6 (2021): 505-508.

Set of rules, considerations, and caveats to keep in mind when designing ML model for chemical science. The authors propose a checklist when evaluating ML models, while intuitive at first, when lot of the new ML papers are scanned through that lens, you can identify the shortcommings of the proposed model. This checklist is especially helpful for those entering just entering the field.

Pharma R&D Business

Overviews and Reviews

F. Strieth-Kalthoff, F. Sandfort, M. H. S. Segler, and F. Glorius, Machine learning the ropes: principles, applications and directions in synthetic chemistry, Chem. Soc. Rev

Pedagogical account of various machine learning techniques, models, representation schemes from perspective of synthetic chemistry. Covers different applications of machine learning in synthesis planning, property prediction, molecular design, and reactivity prediction

Paper outlining good practices for interpretating QSAR (Quantative Structure-Property Prediction) models. Good set of heuristics and comparison in the paper in terms of model interpretability. Create 6 synthetic datasets with varying complexity for QSAR tasks. The authors compare interpretability of graph-based methods to conventional QSAR methods. In regards to performance graph-based models show low interpretation compared to conventional QSAR method.

W. Patrick Walters & Regina Barzilay. Applications of Deep Learning in Molecule Generation and Molecular Property Prediction

Recent review summarising the state of the molecular property prediction and structure generation research. In spite of exciting recent advances in the modeling efforts, there is a need to generate better (realistic) training data, assess model prediction confidence, and metrics to quantify molecular generation performance.

Review from Aspuru-Guzik and Allen’s group discussing how ML can be leveraged for various tasks in drug formulation tasks.

Commentary

Correspondence on assessing the impact of AI on medicinal chemistry. It is a well written account on practical implication of generative design on pharmaceutical research.They outline two recent cases of ‘success’ of AI generative design in drug discovery and give more context and propose best practices for furthering the development of algorithms and drug discovery pipelines.

We need better benchmarking for machine learning in drug discovery

Very good post outlining the focus on the good practices and lack thereof for consistent datasets for comparing different ML algorithms.

Industry-focused drug discovery reviews

Goldman, B., Kearnes, S., Kramer, T., Riley, P., & Walters, W. P. (2022). Defining levels of automated chemical design. Journal of medicinal chemistry, 65(10), 7073-7087.

Group at Relay Therapeutics propose a framework to categorize automated chemical design paradigm - splitting it into generator and decision axes. They give good example of model systems where the machine generates and human chemist select and more recently machine generated and machine chosen designs. In of these discussion, it is evident we havent achieved the full automated execution of a design cycle.

Hasselgren, Catrin, and Tudor I. Oprea. “Artificial Intelligence for Drug Discovery: Are We There Yet?.” Annual Review of Pharmacology and Toxicology 64 (2024). ArXiv

Latest review of the field and application of AI technologies to various functions of drug discovery. In addition to providing a quick review of the main technology the author list different case studies where AI has proposed clinical candidates across different therapeutic areas. Yet, the need for better data, novelty estimation, and validation in wet lab limit the full scale deployment and accuracy of AI pipelines in drug discovery. In closing they also hint at the limitation of ML model training a single property with single structure, QSAR, while in reality the molecule can exist in different conformers something multi-instance learning (MIL) can address. ‘It is reasonable to assume that user expertise, bias, and time constraints play a significant role in early drug discovery, often more so than AI.’

Overview of methods and scope of computational methods used in the drug development process.

Special Journal Issues

Data Science Meets Chemistry

This issue includes contributions that demonstrate the profound impact data science techniques have had in chemistry including chemical and materials synthesis, catalyst and materials design, and overhauling the models used in traditional theoretical or computational chemistry.

Meeting notes

Chemical modalities

Blanco, Maria-Jesus, and Kevin M. Gardinier. “New chemical modalities and strategic thinking in early drug discovery.” ACS medicinal chemistry letters 11.3 (2020): 228-231.

Overview of different chemical modalities currently at work to address different disease targets. The article addresses the small molecule medicinal chemists and how they can expand their outlook of small molecules to include other molecular entities when considering the angle of attack for different target engagement strategies. The authors offer a nice set of tools and thought process when selecting possible drug modalities for different target classes and what questions should be asked when zeroing in a possible mode of action.

Targeted Protein Degradation: Advances, Challenges, and Prospects for Computational Methods

Meta themes on optimizing small molecules

Retrospective analysis on factors influencing the bioavailability of drug candidates. Authors find rotatable bonds and polar surface area or hydrogen bond count (sum of donor and accpetors) found to be important predictors of good oral bioavailability. Compounds having <10 rotatable bonds and <140 A (or < 12 hydrogen bonds) have good chances of being orally bioavailable.

DeGoey, David A., et al. “Beyond the rule of 5: lessons learned from AbbVie’s drugs and compound collection: miniperspective.” Journal of Medicinal Chemistry 61.7 (2017): 2636-2651.

AB-MPS calculated using cLogD, the number of aromatic rings (nAr), and the number of rotatable bonds (nRotB) according to the formula AB-MPS = Abs(cLogD −3) + nAr + nRotB. The lower the AB-MPS score, the more likely the compound is to be absorbed, and a value of ≤14 is reported to predict a higher probability of oral absorption.

Poongavanam, Vasanthanathan, Bradley C. Doak, and Jan Kihlberg. “Opportunities and guidelines for discovery of orally absorbed drugs in beyond rule of 5 space.” Current Opinion in Chemical Biology 44 (2018): 23-29.

Hueristics for oral bioavailability of molecules that are violating the rule of 5. MW may reach up to approximately 1000 Da provided that TPSA increases proportionally up to 250 Å2. In contrast, cLogP and HBDs must be carefully controlled at high MW. Our lack of ability to predict compound conformations and flexibility is currently a hurdle that is critical to overcome to enable further prospective design in oral bRo5 space.

Looks at biosteric replacements for the phenyl rings in the lead optimization phase. Phenyl rings results in improve potency but have poor solubility and lipophilicitty. Find biosteres can be used to improve them.

Ertl, Peter. “Magic Rings: Navigation in the Ring Chemical Space Guided by the Bioactive Rings.” Journal of Chemical Information and Modeling (2021).

Analyze the nature of rings which appear in bioactive compounds. Ring systems are systematically extracted from one billion molecules and are analyzed to discover a structure or correlation in the bioactivity and type of rings. No simple set of structural descriptors separating active and inactive rings could be identified, the separation is best described by a neural network model taking into account a complex combination of many substructure features.

Hartung, Ingo V., Bayard R. Huck, and Alejandro Crespo. “Rules were made to be broken.” Nature Reviews Chemistry 7.1 (2023): 3-4.

Longitudinal analysis of physico-chemical properties for approved drugs in the clinic. They show that most of the drugs flout most of the Lipinski’s rule of 5 except the HBD which is always consistently less 4. In addition, they show that in recent times, by categorizing the drugs in different time-bound classes, the mean MW and HBA has increased but mean HBD has constantly stayed less than 2.

Young, Robert J., et al. “The time and place for nature in drug discovery.” Jacs Au 2.11 (2022): 2400-2416.

Critique on the paper, interesting take

Property-Based Drug Design Merits a Nobel Prize

Probably contentious topic but a good short review of the property-based optimization thought process for medicine discovery.

Pennington, Lewis D., and Demetri T. Moustakas. “The necessary nitrogen atom: a versatile high-impact design element for multiparameter optimization.” Journal of Medicinal Chemistry 60.9 (2017): 3552-3579.

Good perspective highlighting the impact of replacing CH group with N in aromatic and heteraromatic ring systems on molecular and physiochemical properties that translate to improved pharmacological properties.

Synthesis Chemistry

Catalog of recent research articles that look at synthesis chemistry from a point of view of computational workflows, how traditional synthetic chemistry methods can be combined with informatics to augment drug discovery and synthesis processes.

Curated set of substrates to quickly assess the practicality of synthetic methods with the complete capture of success and failure, that can optimize reaction conditions with a broader scope with respect to relevant applications.

Large chemical libraries

Over the past few years several entites offering ultra-large ensembles of chemical libraries which can be made on-demand or purchased immediately have emerged. The existence of such services has reinvigorated the field of virtual screening and combinatorial library design. In addition, research groups have devised novel ways to navigate these libraries, more efficiently and also understand the differences in the chemical space these library cover. Following are some of the key papers in the field.

SpaceCompare: calculation of the overlap of large, nonenumerable combinatorial fragment spaces, utilizes topological fingerprints and the combinatorial character of these chemical spaces. Enamine’s REAL Space, WuXi’s GalaXi Space, and Otava’s CHEMriya. The overlap of the commercial make-on-demand catalogs is only in the low single-digit percent range, despite their large overall size.

Konze, Kyle D., et al. “Reaction-based enumeration, active learning, and free energy calculations to rapidly explore synthetically tractable chemical space and optimize potency of cyclin-dependent kinase 2 inhibitors.” Journal of chemical information and modeling 59.9 (2019): 3782-3793.

PathFinder uses retrosynthetic analysis followed by combinatorial synthesis to generate novel compounds in synthetically accessible chemical space.

Irwin, John J., et al. “ZINC20—a free ultralarge-scale chemical database for ligand discovery.” Journal of chemical information and modeling 60.12 (2020): 6065-6073.

New version of ZINC with two major new features: billions of new molecules and new methods to search them. As a fully enumerated database, ZINC can be searched precisely using explicit atomic-level graph-based methods. Over 97% of the core Bemis–Murcko scaffolds in make-on-demand libraries are unavailable from “in-stock” collections. Correspondingly, the number of new Bemis–Murcko scaffolds is rising almost as a linear fraction of the elaborated molecules. Thus, an 88-fold increase in the number of molecules in the make-on-demand versus the in-stock sets is built upon a 16-fold increase in the number of Bemis–Murcko scaffolds. The make-on-demand library is also more structurally diverse than physical libraries

Neumann, Alexander, Lester Marrison, and Raphael Klein. “Relevance of the Trillion-Sized Chemical Space “eXplore” as a Source for Drug Discovery.” ACS Medicinal Chemistry Letters (2023).

The authors examine the composition of the recently published and, so far, biggest chemical space, “eXplore”, which comprises approximately 2.8 trillion virtual product molecules. The utility of eXplore to retrieve interesting chemistry around approved drugs and common Bemis Murcko scaffolds has been assessed with several methods (FTrees, SpaceLight, SpaceMACS). Further, the overlap between several vendor chemical spaces and a physicochemical property distribution analysis has been performed. Despite the straightforward chemical reactions underlying its setup, eXplore is demonstrated to provide relevant and, most importantly, easily accessible molecules for drug discovery campaigns.

Medina, Jorge, and Andrew D. White. “Bloom filters for molecules.” arXiv preprint arXiv:2304.05386 (2023).

This paper proposes and studies Bloom filters for testing if a molecule is present in a set using either string or fingerprint representations. Bloom filters are small enough to hold billions of molecules in just a few GB of memory and check membership in sub-milliseconds. The authors found string representations can have a false positive rate below 1% and require significantly less storage than using fingerprints. Canonical SMILES with Bloom filters with the simple FNV hashing function provide fast and accurate membership tests with small memory requirements. They provide a general implementation and specific filters for detecting if a molecule is purchasable, patented, or a natural product according to existing databases at https://github.com/whitead/molbloom.

Virtual screeening

Dodds, Michael, et al. “Sample efficient reinforcement learning with active learning for molecular design.” Chemical Science 15.11 (2024): 4146.

An active learning system linked with an RL model (RL–AL) for molecular design, which aims to improve the sample-efficiency of the optimization process.

AZ group looking at generative design + RL + Virtual screening campaign

Vakili, Mohammad Ghazi, et al. “Quantum Computing-Enhanced Algorithm Unveils Novel Inhibitors for KRAS.” arXiv preprint arXiv:2402.08210 (2024).

First paper to showcase deployment of quantum-based generative models with virtual screening workflow on a target-based compound discovery. Chemistry42 is used for reward function implementation. The authors show that molecule generated from this combined effort are ‘better’ quality-wise than traditional workflow (LSTM and Genetic algorithm) and the modeling success downstream is roughly correlated with number of qubits employed. This is exciting more from technical standpoint of combining quantum + traditional workflows.

Sadybekov, Anastasiia V., and Vsevolod Katritch. “Computational approaches streamlining drug discovery.” Nature 616.7958 (2023): 673-685.

Nice review on virtual screening workflow for streamlining drug discovery

Gorgulla, Christoph, et al. “An open-source drug discovery platform enables ultra-large virtual screens.” Nature 580.7805 (2020): 663-668.

VirtualFlow as a tool for conducting virtual screening. The authors use VirtualFlow, to prepare one of the largest and freely available ready-to-dock ligand libraries, with more than 1.4 billion commercially available molecules. They screening ~1 billion compounds and identified a set of structurally diverse molecules that bind to KEAP1 with submicromolar affinity.

Researchers at UCSF looking at large scale docking for making ultra-large libraries accessible. They dock 170 million make-on-demand compounds from 130 well characterized reactions. Found new chemotypes that have interaction with 2 targets.

Fragment-based drug discovery

What is a fragment?

In fragment-based drug discovery (FBDD), a fragment is a small, low-molecular-weight chemical entity typically ranging from 120-250 Daltons, with properties such as cLogP < 2.5, hydrogen atom count (HAC) of 9-18, hydrogen bond acceptors (HBA) < 7, and hydrogen bond donors (HBD) < 4, as specified by Asinex. These fragments serve as starting points for drug development. They bind to target proteins with low affinity but high efficiency, enabling the identification of key interactions. By optimizing and linking fragments, researchers can develop potent lead compounds, enhancing the efficiency of the drug discovery process and improving hit finding.

Known collection

Blogs

Practical Fragment blog

Book Chapters

Rees, D. C.; Congreve, M.; Murray, C. W.; Carr, R. Fragment-Based Lead Discovery. Nat Rev Drug Discov 2004, 3 (8), 660–672

The paper by Rees, D. C.; Congreve, M.; Murray, C. W.; Carr, R. discusses the concept of fragment-based lead discovery in drug development. The authors highlight the challenges in the drug discovery pipeline, particularly the ‘target-rich, lead-poor’ issue and the high attrition rate of chemical compounds in preclinical development. They discuss the use of fragment-based approaches as a solution, which involves the selection, screening, and optimization of fragments (also referred to as needles, shapes, binding elements, or seed templates) for lead identification. This approach requires significantly fewer compounds to be screened and synthesized, and has a high success rate in generating chemical series with lead-like properties. The authors also discuss different strategies for developing fragments into high-affinity leads, such as fragment evolution and fragment linking. The paper includes examples from 25 different protein targets to illustrate these strategies.

Articles

Fragmenstein: predicting protein-ligand structures of compounds derived from known crystallographic fragment hits using a strict conserved-binding–based methodology

Fragmenstein, stitches ligand atoms from known fragment hits to predict protein-ligand complex conformations more accurately. Fragmenstein uses a Python package to merge or place compounds by stitching together atoms from fragment hits and then energy minimizing them under strong constraints.

The authors use a fragment screening approach to look at hits for protein kinase target and instead of using biophysical assay in fragment screening use crystallographic data directly to learn the conformation of the fragments. They find 4 ‘seed’ substructures which fit nicely in the protein(not affinity) and use those to inform the latter expansion which is done through the Enamine REAL dataset and known reaction classes. What I liked the most and found interesting is the high throughput binding pose and docking workflow of 200k compounds, the large scale crystallographic fragment hit analysis, and the focused curated library generation using Enamine REAL dataset. I was curious to know what seasoned experts had to say about this.

BROOD

Commercial software solution from OpenEye for fragment exchange

Protein engineering

PeSTo: parameter-free geometric deep learning for accurate prediction of protein binding interfaces

Protein-ligand interactions

Yim, Jason, et al. “Diffusion models in protein structure and docking.” Wiley Interdisciplinary Reviews: Computational Molecular Science 14.2 (2024): e1711.

Nice review of the field that looks at computational model to predict protein-ligand interaction and molecular docking. In recent times, diffusion-based model have shown promising performance. This review documents the current state of the field and next steps. This survey covers DMs primarily from the point of view of backbone generation, both unconditional and conditional generation. It is interesting to see how modeling the backbone, sequence, and side chains would bring further benefit beyond the current strategy of modeling them one after the other.

Iambic Therapeutics’ AI model that predicts the combined shape of proteins and small molecules outperforms Google DeepMind’s AlphaFold. Lambic’s newest model, called NeuralPLexer2, had a 75% success rate in predicting these protein-ligand structures. AlphaFold’s latest version, as described last October in a blog post, was 74% successful. But Iambic’s model jumps to a 93% success rate when the model receives additional info on amino acids near the small molecule.

RFdiffusion All-Atom Github

RoseTTAFold All-Atom (RFAA) was trained to represent amino acids and DNA bases at the residue level and all other chemical groups at the atomic level, allowing it to accurately model proteins and the other chemicals they so often interact with.

RFdiffusion All-Atom: build bespoke protein structures around small molecules. The team designed and experimentally validated, through crystallography and binding measurements, proteins that bind the cardiac disease therapeutic digoxigenin, the enzymatic cofactor heme, and the light-harvesting molecule bilin. Although there is still room for improvement in prediction accuracy, we anticipate that RFAA should be broadly useful for modeling full biological assemblies and RFdiffusionAA for designing small molecule–binding proteins and sensors.

DynamicBind

DynamicBind, a deep learning method that employs equivariant geometric diffusion networks to construct a smooth energy landscape, promoting efficient transitions between different equilibrium states. DynamicBind accurately recovers ligand-specific conformations from unbound protein structures without the need for holo-structures or extensive sampling.

Yu, Jie, et al. “Computing the relative binding affinity of ligands based on a pairwise binding comparison network.” Nature Computational Science 3.10 (2023): 860-872.

Conformer generators

Structure Quality Assessment

Buttenschoen, Martin, Garrett M. Morris, and Charlotte M. Deane. “PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences.” arXiv preprint arXiv:2308.05777 (2023).

Python package for evaluating the quality of docked poses. PoseBusters performs a series of geometry checks on docked poses and also evaluates intra and inter-molecular interactions. The authors used the Astex Diverse Set and a newly developed PoseBusters benchmark set to evaluate five popular deep learning docking programs and two conventional docking approaches. The conventional docking programs dramatically outperformed the deep learning methods on both datasets. In most cases, more than half of the solutions generated by the DL docking programs failed the PoseBusters validity tests. In contrast, with the conventional docking programs, only 2-3% of the docked poses failed to validate.

Morehead, Alex, et al. “Deep Learning for Protein-Ligand Docking: Are We There Yet?.” arXiv preprint arXiv:2405.14108 (2024)Github

Suite of tools and workflow to benchmark Deep learning method’s ability to predict protein-ligand interaction modeling - from apo to halo configuration for P/L pairs. The authors find that all but one DL method fail to generalize to multi-ligand protein targets.

Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models

PoseCheck evaluates steric clashes, ligand strain energy, and intramolecular interactions to identify problematic structures. In addition, structures are redocked with AutoDock Vina to confirm the validity of the proposed binding mode. In evaluating several recently published generative models, the authors identify failure modes that will hopefully influence future work on structure-based generative design.

Binding energetic prediction

Warren, M., Deane, C., Magarkar, A., Morris, G., & Biggin, P. (2024). How to make machine learning scoring functions competitive with FEP.

Current SOTA model fail in out of distribution datasets implying overfitting or memorization of ligand-specific features. This paper introduces AEV-PLIG, atomic environment vector with protein-ligand interaction graphs. They propose new benchmark metrics and data augmentation strategies.A multi-head attention graph NN is trained with the node features of the P-L interaction. They report comparable performance to FEP+ on standard benchmarks.

Koh, Huan Yee, et al. “Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data.” Nature Machine Intelligence (2024): 1-15.. Github

PSICHIC (PhySIcoCHemICal graph neural network), a framework incorporating physicochemical constraints to decode interaction fingerprints directly from sequence data alone.

Molecular-dynamics

MDANCE

Molecular Dynamics Analysis with N-ary Clustering Ensembles (MDANCE) is a flexible n-ary clustering package that provides a set of tools for clustering Molecular Dynamics trajectories.

Cheminformatics-focus

Catalog of recent reviews and manuscripts I have found useful when learning more about the state-of-the-art in Cheminformatics. I’ve tried to categorize them roughly based on their area of application:

Representation

Small molecules to be understood by computers and used for model training have to represented in a form amenable for optimization. In addition, this form of abstraction much capture appropriate level of chemical properties so as to imbue the data-driven models with necessary chemistry and physics for modeling. A lot of times different properties of the molecules are ‘lost in translation’ or obfuscated when converting them into machine-ready forms. Formerly the process of converting molecules from one form to another is called featurization. There are different forms, methods, theories to encode the molecules. Broadly there are as follows: * Fingerprints * Descriptors * Pharmacophores * Graph-based * Natural language-based * Shape-based

Reviews

Articles

Boulougouri, Maria, Pierre Vandergheynst, and Daniel Probst. “Molecular set representation learning.” (2023).

The authors propose a new way to represent molecules, not as chemical bonds, but rather set representations. They show the set representation scheme can be alternative to SOTA graph-models and performs at par to the predictive tasks such as reaction yields and protein-ligand affinities.

Comparative study of descriptor-based and graph-based models using public data set. Used descriptor-based models (XGBoost, RF, SVM, using ECFP) and compared them to graph-based models (GCN, GAT, AttentiveFP, MPNN). They show descriptor-based models outperform the graph-based models in terms of prediction accuracy and computational efficiency with SVM having best predictions. Graph-based methods are good for multi-task learning.

He, Jiazhen, et al. “Molecular optimization by capturing chemist’s intuition using deep neural networks.” Journal of cheminformatics 13.1 (2021): 1-17.

Descriptor generation

Morfeus Machine learning focused descriptors for small molecules with emphasis on 3D information.
Mordred
MolFeat
DeepChem
PMapper

Pmapper is a Python module to generate 3D pharmacophore signatures and fingerprints. Signatures uniquely encode 3D pharmacophores with hashes suitable for fast identification of identical pharmacophores.

SteriMol

Generate conformationally sampled descriptors for a molecule. This workflow provides Boltzmann-averaged Sterimol parameters. These descriptors might be useful for problems where spatial orientation inform the selectivity or properties being trained for. Github

Predictive modeling

Reviews

Chemical complexity challenge: Is multi-instance machine learning a solution

Traditional ML uses the relationship between a single instance (a chemical structure) and a single label (a property). It doesn’t provide a facility for mapping multiple instances (an ensemble of conformers) to a label. There has recently been renewed interest in multiple instance learning (MIL), a technique developed over 30 years ago. MIL provides a framework that enables the mapping of conformational ensembles to properties. A recent review by Zankov from Hokkaido University and coworkers at other institutions provides an excellent overview of the challenges and opportunities associated with MIL in QSAR, genomics, and several other areas. The paper also provides links to several software packages for building MIL models.

Current Methods for Drug Property Prediction in the Real World

Overview of the field and some factors that complicate current benchmarking efforts. The authors compared several molecular representations and ML algorithms in evaluating model accuracy and uncertainty. These evaluations highlighted the strengths of different QSAR modeling and ADME prediction methods. Consistent with other papers published in 2023, 2D descriptors performed best for ADME prediction, while Gaussian Process Regression with fingerprints was the method of choice when predicting biological activity.

Rationalizing general limitations in assessing and comparing methods for compound potency prediction

A paper by Janela and Bajorath outlines several limitations in current benchmarking strategies. The authors used sound statistical methodologies to examine the impact of compound potency value distributions on performance metrics associated with regression models. They found that across several different ML algorithms, there was a consistent relationship between model performance and the activity range of the dataset. These findings enabled the authors to define bounds for prediction accuracy. The method used in this paper should be informative to those designing future benchmarks.

Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., Guzman-Perez, A., Hopper, T., Kelley, B., Mathea, M. and Palmer, A., 2019. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8), pp.3370-3388

Benchmark property prediction models on 19 public and 16 proprietary industrial data sets spanning a wide variety of chemical end points. Introduce a modeling framework (Chemprop) that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary data sets.

Stuyver, T. and Coley, C.W., 2021. Quantum chemistry-augmented neural networks for reactivity prediction: Performance, generalizability and interpretability. arXiv preprint arXiv:2107.10402

Combine structure (Graph-networks) and descriptor based features (QM-derived) to predict activation energies (E₂/SN₂ barrier height prediction) and regioselectivity. Incorporating QM and structure leads to better overall accuracy and generalizability even in low data regions. Atom and bond level features derived using QM and used in the model generation with a smaller dataset.

pKa prediction

Abarbanel, Omri, and Geoffrey Hutchison. “QupKake: Integrating Machine Learning and Quantum Chemistry for micro-pKa Predictions.” (2023).. Github

QupKake combines GFN2-xTB calculations with graph-neural-networks to accurately predict micro-pKa values of organic molecules.

Hydrogen bond energy

Jazzy: Fast calculation of hydrogen-bond strengths and free energy of hydration of small molecules

Pharmacophore / shape searching

PheSA is a CPU bound algorithmic improvement over ROCS shape/color alignment that gives you an option to incorporate binding site knowledge.

ROSHAMBO is a GPU accelerated implementation of ROCS

QSAR benchmarks

Llompart, P., et al. “Will we ever be able to accurately predict solubility?” Scientific Data 11.1 (2024): 303.

The paper discusses challenges in predicting thermodynamic solubility with machine learning. It reviews historical data, analyzes solubility datasets, and assesses model reliability. The authors propose a workflow for data curation and present new models and quality-assessed datasets for public use.

Deng and coworkers from Stony Brook University compared many popular ML algorithms and representations, curated new datasets, and performed statistical analysis on the results. This paper provides one of the best comparisons of ML methods published to date. The authors compare fixed representations, such as molecular fingerprints, with representations learned from SMILES strings and molecular graphs and conclude that, in most cases, the fixed representations provide the best performance. Another interesting aspect of this paper was an attempt to establish a relationship between dataset size and the performance of different molecular representations. While fixed representations performed well on smaller datasets, learned representations didn’t become competitive until between 6K and 100K datapoints were available.

Fang, Cheng, et al. “Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective.” Journal of Chemical Information and Modeling (2023).

A paper from Fang and coworkers at Biogen introduced several new ADME datasets. Unlike most literature benchmarks, which contain data collected from dozens of papers, these experiments were consistently performed by the same people in the same lab. The authors provided prospective comparisons of several widely used ML methods, including random forest, SVM, XGBoost, LightGBM, and message-passing neural networks (MPNNs) on several relevant endpoints, including aqueous solubility, metabolic stability, membrane permeability, and plasma protein binding.

Beckers, Maximilian, et al. “Prediction of Small-Molecule Developability Using Large-Scale In Silico ADMET Models.” Journal of Medicinal Chemistry (2023).

The paper presents a novel deep learning approach to predict the developability of small molecules based on their predicted ADMET properties. The authors use a large-scale data set of compounds from the Novartis pipeline and train a neural network to rank compounds according to their potential to progress beyond in vivo PK studies. The resulting score, called bPK, outperforms other compound scoring methods and shows strong generalization ability on public data. The authors demonstrate the usefulness of bPK for series prioritization and optimization in drug discovery projects.

D van Tilborg, Derek, Alisa Alenicheva, and Francesca Grisoni. “Exposing the limitations of molecular machine learning with activity cliffs.” Journal of Chemical Information and Modeling 62.23 (2022): 5938-5951.

Account on how to treat and analyze activity cliffs in context of developing a predictive model. The authors outline best practices to probe activity cliffs. They show, using 24 DL and ML models and 30 targets, ML approaches based on molecular descriptors outperformed more complex deep learning methods. Activity cliff pairs were defined on similarity of the molecule SMILES and the bioactivity difference. Compared to most traditional machine learning approaches, deep neural networks seem to fall short at picking up subtle structural differences (and the corresponding property change) that give rise to activity cliffs.

Authors provide an evaluation of global and local models for ADME endpoint prediction. They compare the performance of global models and domain-specific local models. 10 different asays and 112 drug discovery projects were analyzed. The results showed consistent superior performance of global ADME models for property prediction. Performance improvement of global models over project-wise local models ranged from 3% to 25% in MAE. Local model improvements higher than 20% were achieved for only 7% of the assay-project pairs.

Swanson, Kyle, et al. “ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries.” BioRxiv (2023): 2023-12. Web interface Code

Online web interface to quickly predict ADMET properties using specific AI models trained on Therapeutic Data Commmons dataset.

Datasets

QMugs

QMugs (Quantum mechanical properties of drug-like molecules) collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling 2M conformers.

Matched molecular-pair

Enumeration of chemical space

Bellmann, Louis, et al. “Comparison of Combinatorial Fragment Spaces and Its Application to Ultralarge Make-on-Demand Compound Catalogs.” Journal of Chemical Information and Modeling (2022).

Authors propose an algorithmic approach called as SpaceCompare to calculate overlap and diversity of the ultra-large combinatorial chemical libraries. The tool uses topological fragment spaces to capture the subtlties of the reaction having same product but different reactant substructures.

Interesting work on de-novo design of molecules wherein, the molecules being created are made up from the fragments that is known to exist and are available to the user. New molecules are generated based on the fragmented (synthons) made available in the dataset.

Fully Automated Creation of Virtual Chemical Fragment Spaces Using the Open-Source Library OpenChemLib

Open-source tool to generate synthetically accessible chemical spaces using reaction definitions and building blocks. Virtual fragments are generated using one-step reaction and real-world building blocks - the workflow also support 2-3 steps creation.

Zabolotna, Yuliana, et al. “NP navigator: a new look at the natural product chemical space.” Molecular informatics 40.9 (2021): 2100068..

Organizing the chemical space of ChEMBL, and ZINC to compare its overlap with natural products through COCONUT. Generative Topological Mapping is used for the clustering and analysis. Helpful overview of the method with its application to drug discovery can be found here

Explainable/Interpretable Machine Learning

Reviews/Perspectives

Articles

Uncertainty quantification

Benchmark different models and uncertainty metrics for molecular property prediction.

Evidential Deep learning for guided molecular property prediction and disocovery Ava Soleimany, Conor Coley, et. al.. Slides

Train network to output the parameters of an evidential distribution. One forward-pass to find the uncertainty as opposed to dropout or ensemble - principled incorporation of uncertainties

Conduct a global multi-objective optimization with expected improvement criterion. Find transition metal complex redox couples for Redox flow batteries that address stability, solubility, and redox potential metric. Use distance of a point from a training data in latent space as a metric to quantify uncertainty.

Active Learning

Active learning provides strategies for efficient screening of subsets of the library. In many cases, we can identify a large portion of the most promising molecules with a fraction of the compute cost.

Comparisons

Traversing Chemical Space with Active Deep Learning

The authors compared six active learning approaches on three benchmark datasets and concluded that the acquisition function is critical to AL performance. When comparing molecular representations, they found that fingerprints generalized better than graph neural networks. Consistent with previous studies, they found that the choice of an initial training set had little impact on the outcome of an AL model.

Articles

Correy, Galen J., Moira M. Rachman, Takaya Togo, Stefan Gahbauer, Yagmur U. Doruk, Maisie GV Stevens, Priyadarshini Jaishankar et al. “Extensive exploration of structure activity relationships for the SARS-CoV-2 macrodomain from shape-based fragment merging and active learning.” bioRxiv (2024): 2024-08.

Very nice work from Shoichet (Zinc, UCSF) + Relay folks (Pat Walter et al) expanding on their active learning method to SAR CoV2. Use FrankenROCS and Thompson sampling to screen millions of compounds, identifying hits with IC50 values as low as 130 μM + ~100 X-ray crystal structures with binding data. Have been a big fan of active learning esp. multi-armed bandits.

Gusev, Filipp, et al. “Active learning guided drug design lead optimization based on relative binding free energy modeling.” Journal of Chemical Information and Modeling 63.2 (2023): 583-594.

Active learning on BDE.

Article talks about MolPAL as an active learning methodology. The team explores the application of these techniques to computational docking datasets and assess the impact of surrogate model architecture, acquisition function, and acquisition batch size on optimization performance. We observe significant reductions in computational costs; for example, using a directedmessage passing neural network we can identify 94.8% or 89.3% of the top-50 000 ligands in a 100M member library after testing only 2.4% of candidate ligands using an upper confidence bound or greedy acquisition strategy, respectively.

Thompson, James, et al. “Optimizing active learning for free energy calculations.” Artificial Intelligence in the Life Sciences 2 (2022): 100050.

Article exploring different active learning strategies for looking at sampling the congeneric RBFE calculations. The paper explores the impact of several AL design choices. They show that in their case, the overall AL performance is largely insensitive to the specific ML method and acquisition functions used. The significant factor affecting the performance was the number of molecules sampled at each iteration.

Train property prediction model to output a distribution statistics in single pass that describes the uncertainty. This is in contrast to using ensemble models like MC dropout. Interesting way to estimate the epistemic (due to / from model) uncertainty in the prediction. Use this approach on antibiotic search problem of Stokes et. al. Compare Chemprop and SchNet models on different tasks.

Multi-parameter optimization

Transfer Learning

Reviews

Cai, Chenjing, et al. “Transfer learning for drug discovery.” Journal of Medicinal Chemistry 63.16 (2020): 8683-8694.

Articles

Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning

Transfer learning by training a network to DFT data and then retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology, and chemistry, and billions of times faster than CCSD(T)/CBS calculations.

Improving the generative performance of chemical autoencoders through transfer learning

Meta Learning

Altae-Tran, H., Ramsundar, B., Pappu, A. S., & Pande, V. (2017). Low data drug discovery with one-shot learning. ACS central science, 3(4), 283-293.

Authors demonstrate how one-shot learning can be used to signifinicantly lower the amount of data required to make predictions in drug discovery tasks. LSTM combined with GCNNs is shown to improve learning capabilities of the model. In the simplest one-shot learning formalism these continuous vectors are then fed into a simple nearest-neighbor classifier that labels new examples by distance-weighted combination of support set labels

Nguyen, C. Q., Kreatsoulas, C., & Branson, K. M. (2020). Meta-learning GNN initializations for low-resource molecular property prediction. arXiv preprint arXiv:2003.05996.

Use CheMBL dataset to train a gated graph neural network (GGNN) for prediction and classification tasks using meta learning protocols. Show appreciable model performance even with just approx. 256 datapoints.

Federated Learning

Consortia comprising of leading resarch labs and companies working on decentralized datasets and predictive modeling of biochemical and cellular activity.

Generative design

Reviews

Benchmarks

Thomas, Morgan, et al. “MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design.” Journal of Cheminformatics 16.1 (2024): 1-20.. Github

MolScore contains code to score and benchmark de novo compounds in the context of generative de novo design by generative models via the subpackage molscore, as well as, facilitate downstream evaluation via the subpackage moleval. An objective is defined via a JSON file which can be shared to propose new benchmark objectives, or to conduct multi-parameter objectives for drug design.

Flam-Shepherd, Daniel, Kevin Zhu, and Alán Aspuru-Guzik. “Keeping it Simple: Language Models can learn Complex Molecular Distributions.” arXiv preprint arXiv:2112.03041 (2021).. Nature Comms Link

Test SOTA language models and representation performance against graph-based methods (CGVAE, JTVAE) for ‘challenging’ generative modeling tasks - generate a molecule - property distribution as a function of synthetic feasiblity. Graph models faced chanllenge in generating large molcules (> 100 HAs). Selfies provided advantage here. All of the models seem to generate novel molecules - how practical each of these novel molecules are is yet an open question.

MOSES - Benchmarking platform for generative models.

Propose a platform to deploy and compare state-of-the-art generative models for exploring molecular space on same dataset. In addition the authors also propose list of metrics to evaluate the quality and diversity of the generated structures.

GuacaMol: Benchmarking models for De Novo Molecular Design. Blogpost

Evaluation framework from BenevolentAI to compare different de-novo design models.

J. Zhang, R. Mercado, O. Engkvist, and H. Chen, “Comparative Study of Deep Generative Models on Chemical Space Coverage,” J. Chem. Inf. Model., vol. 61, no. 6, pp. 2572–2581, Jun. 2021.

Interesting analysis from team at AstraZeneca R&D. They look at the chemical space coverage accounted by the SOTA generative models. Proposes a metric for evaluating space coverage, and thereby comparing different SOTA models, using a reference data (GDB-13 in this case). The new metric computes how much of the GDB-13 dataset can be recovered by a model that is trained on small GDB subset. Generative models were trained on same 1M data points and 1B molecules were then sampled from each model. It was seen that at most 39% of the molecules in the parent dataset were sampled / generated by the model. Most models sampled the same compounds atleast twice. It was observed that graph-based model sampled much diverse molecules than string-based methods. Besides, the coverage of GAN-based models was worse compared to Language and Graph models.

Gao, W.; Coley, C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020

This paper looks at different ways of integrating synthesizability criteria into generative models.

REINVENT4: Modern AI–Driven Generative Molecule Design [Supported with PyTorch 2.0]

REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. At its heart, REINVENT uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.

Language models:

Collaboration with Microsoft AI and Global Health Drug Discovery Institute. TamGen, a method that employs a GPT-like chemical language model and enables target-aware molecule generation and compound refinement.The authors identified 7 compounds showing compelling inhibitory activity against the Tuberculosis ClpP protease. The model considers 1-D information of the protein and molecule.

Maziarz, Krzysztof, et al. “Learning to extend molecular scaffolds with structural motifs.” arXiv preprint arXiv:2103.03864 (2021).. Github

Team at Novartis and Microsoft propose MoLeR, graph based model to generate molecule using scaffold as a seed. Scaffold based SAR speed up shown.

Ross, Jerret, et al. “Large-scale chemical language representations capture molecular structure and properties.” Nature Machine Intelligence 4.12 (2022): 1256-1264.. [Github]https://github.com/IBM/molformer?tab=readme-ov-file)
SELFIES and generative models using STONED

Reproducibility study of the STONED work from Jablonka et. al.

Representation using SELFIES proposed to make it much more powerful

Iovanac, Nicolae C., Robert MacKnight, and Brett Savoie. “Actively Searching: Inverse Design of Novel Molecules with Simultaneously Optimized Properties.” ChemRxiv (2021)

Using quantum chemistry attributes calculated on-the-fly as scoring functions for sampling the generative model chemical space. Active learning strategy is deployed to explore the area of space where the properties of the molecules are unknown.

R. Gómez-Bombarelli et al., “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules,” ACS Cent. Sci., vol. 4, no. 2, pp. 268–276, 2018

One of the first implementation of a variational auto-encoder for molecule generation.

Graph-based

Flam-Shepherd, Daniel, Alexander Zhigalin, and Alán Aspuru-Guzik. “Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning.” arXiv preprint arXiv:2202.00658 (2022)

Reinforcement learning-based generative model whici is an update on point cloud approach by the same group to now incorporate ‘grammar’ for building molecules in form of functional groups in 3D space.

W. Jin, R. Barzilay, and T. Jaakkola, “Junction tree variational autoencoder for molecular graph generation,” 35th Int. Conf. Mach. Learn. ICML 2018, vol. 5, pp. 3632–3648, 2018

Junction tree based decoding. Define a grammar for the small molecule and find sub-units based on that grammar to construct a molecule. The molecule is generated in two-steps: first being generating the scaffold or backbone of the molelcule, then the nodes are added with molecular substructure as identified from the ‘molecular grammar’.

GANs

MolGAN: An implicit generative model for small molecular graphs, N. De Cao and T. Kipf, 2018

Generative adversarial network for finding small molecules using graph networks, quite interesting. Avoids issues arising from node ordering that are associated with likelihood based methods by using an adversarial loss instead (GAN)

Scaffold-retained

Team at Novartis and Microsoft propose MoLeR, graph based model to generate molecule using scaffold as a seed. Scaffold based SAR speed up shown.

Reaction tranformation-based

Here the idea is to constraint the molecules generated by the transformations amenable to a particular platform, like automated synthesis workflow.

Authors propose a generative model to generate molecules via multi-step chemical reaction trees, each campaign first generates a reaction-tree with template transformations as breaking points.

Bradshaw, John, et al. “A model to search for synthesizable molecules.” Advances in Neural Information Processing Systems 32 (2019).

Diffusion models

Adams, K., Abeywardane, K., Fromer, J., & Coley, C. W. (2024). ShEPhERD: Diffusing shape, electrostatics, and pharmacophores for bioisosteric drug design. arXiv preprint arXiv:2411.04130.

A SE(3)-equivariant diffusion model for 3D molecule structures and their interaction profile with the target of choice. The authors show their model application for typical drug design tasks including hit diversification, bioisosteric replacement and fragment merging, and ligand hopping. Shepherd is a joint denoising diffusion probabilistic model (DDPM) that learns the joint distribution over 3D molecules (atom types, bond types, coordinates) and their 3D shapes, ESP surfaces, and pharmacophores.

Sako, Masami, Nobuaki Yasuo, and Masakazu Sekijima. “DiffInt: A Pharmacophore-Aware Diffusion Model for Structure-Based Drug Design with Explicit Hydrogen Bond Interaction Guidance.” (2024).. Github

DiffInt as a novel structure-based approach that explicitly addresses interactions. The model naturally incorporates hydrogen bonds between the protein and ligand by treating them as pseudoparticles.

Mixed Continuous and Categorical Flow Matching for 3D De Novo Molecule Generation

This model extends beyond traditional diffusion models by learning to map samples directly from arbitrary distributions, allowing for greater flexibility and application-specific model design. It is achieving remarkable efficiency (>10-fold reduction in inference time) and accuracy in molecule generation.

3D conformations-aware

Extension to the fragment-based generative design model (DeepFMPO) using reinforcement learning now incorporating 3D electrostatic similarity in the analysis. Ability to replace fragment with similar 3D shape and electrostatics. ESP_sim tutorial for comparison of electrostatic potential and molecule shape is used for this purpose. The authors find scaffold-hopping bioisoteres for CDK2.

Protein-ligand interactions aware

Zhang, Jie, and Hongming Chen. “De novo molecule design using molecular generative models constrained by ligand–protein interactions.” Journal of Chemical Information and Modeling 62.14 (2022): 3291-3306.

Linker design

Interesting work on designing linkers using conformation aware generative design algorithm. Think of it like fragment-growing.

Nori, Divya, Connor W. Coley, and Rocío Mercado. “De novo PROTAC design using graph-based deep generative models.” arXiv preprint arXiv:2211.02660 (2022).

Synthetic-cost aware

Fromer, Jenna C., and Connor W. Coley. “An algorithmic framework for synthetic cost-aware decision making in molecular design. Sparrow” Nature Computational Science (2024): 1-11.. Preprint. Medium blogpost

SPARROW prioritizes molecules that both have high rewards and can be synthesized in a few steps from cheap starting materials. It is also shown to combine the library-focused and generative design based compounds in one setting depending on the harmony of the proposed synthetic routes and costs.

Computer Aided Synthesis Planning (CASP)

Reviews:

Perspective on the current SOTA of synthesis planning, automation, and reaction optimization in drug discovery and development phases using AI and ML.

Perspective article summarising their position on the current state of research and future considerations on developing better reaction network models. Break down the analysis of reaction networks as into 3 classes (1) Front Open End: exploration of products from reactants (2) Backward Open Start: Know the product and explore potential reactants (3) Start to End: Product and reactant known, explore the likely intermediates.

Nice summary of potential challenges in the field:

Validating exploration algorithms on a consistent set of reaction system.
Need to generate a comparative metric to benchmark different algorithms.
Considering effect of solvents and/or protein embeddings in the analysis

Best practices

Gimadiev, T. R., Lin, A., Afonina, V. A., Batyrshin, D., Nugmanov, R. I., Akhmetshin, T., … & Varnek, A. (2021). Reaction Data Curation I: Chemical Structures and Transformations Standardization. Molecular Informatics, 2100119.

Article from Varnek group on best practices on processing data for reaction informatics.

Benchmarking

Benchmarking framework for comparing different multi-step retrosynthesis methods from researchers at AstraZeneca R&D. Provides 10k synthetic routes which can be used as a validation set for different methodologies, providing a platform for systematic comparison of different methods being proposed in the community.

Classifying chemical reactions:

Schwaller, Philippe, et al. “Mapping the space of chemical reactions using attention-based neural networks.” Nature Machine Intelligence 3.2 (2021): 144-152.. rxnfp - Github. Preprint. News Article.

Transformer-based model for reaction classification. Compared it with BERT. Besides classification, the work also formalizes the reaction fingerprint generation using the learned representations. The reaction fingerprints are visualized using TMAPS.

Heid, E; Green, W; Machine learning of reaction properties via learned representations of the condensed graph of reaction. ChemRxiv (2021)

Reaction classifiction prediction using atom-mapped reaction that are used to generate condensed reaction graphs and passed through a GCN-variant as implemented in chemprop.

Reaction-specific features

Using scrapped US Patent data to classify chemical reactions and deploy various fingerprints and ML models for classification.

Atom mapping:

Comparative analysis of different atom-mapping schemes for generating atom-mapped reaction features. Comments on the state of the art methods and their performance on a curated reaction database.

Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. RXMapper

Data-driven atom mapping schemes which uses transformers for learning the context of the chemical reaction. Researchers at IBM trained a flavor of language model based on Transformer architecture and used it to find reaction centers and maps atoms. Shown to be robust compared to other SOTA methods.

Automatic mapping of atoms across both simple and complex chemical reactions

Predicting reaction outcomes:

Data-Efficient, Chemistry-Aware Machine Learning Predictions of Diels–Alder Reaction Outcomes

Researchers use NERF that model electron flow in the reaction alongside bond-enviorment to propose diels-alder chemistry products.

C. W. Coley et al., “A graph-convolutional neural network model for the prediction of chemical reactivity,” Chem. Sci., vol. 10, no. 2, pp. 370–377, 2019.

Template-free prediction of organic reaction outcomes using graph convolutional neural networks

Conformational sampling and designing of chiral organic catalysts.

Yield prediction

Raghavan, Priyanka, et al. “Incorporating Synthetic Accessibility in Drug Design: Predicting Reaction Yields of Suzuki Cross-Couplings by Leveraging AbbVie’s 15-Year Parallel Library Data Set.” Journal of the American Chemical Society (2024).

Evaluation of AbbVie’s medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields.

Ma, Yihong, et al. “Are we making much progress? Revisiting chemical reaction yield prediction from an imbalanced regression perspective.” Companion Proceedings of the ACM on Web Conference 2024. 2024.

Through experiments on real-world datasets, they demonstrate that treating reaction yield prediction as an imbalanced regression problem and incorporating cost-sensitive reweighting methods can significantly improve predictions for underrepresented high-yield reactions.

Classic paper, one of the firsts to show modeling reaction yields using ML. A random forest algorithm, was used to predict synthetic reaction performance in multidimensional chemical space with high-throughput experimentation data. Descriptors for components in a palladium-catalyzed Buchwald-Hartwig cross-coupling were computed and used as inputs. The random forest model outperformed linear regression in predictive accuracy, even with sparse training sets and out-of-sample predictions, highlighting its potential for synthetic methodology adoption.

Retrosynthetic routes:

Do Chemformers Dream of Organic Matter? Evaluating a Transformer Model for Multistep Retrosynthesis

Transformer model evaluation for retrosynthesis from AZ folks. Template-free method.

Westerlund, Annie, et al. “Data-driven approaches for identifying hyperparameters in multi-step retrosynthesis.” (2023).

Meta analysis on the best set of hyperparameters for retrosynthesis routines. Here the authors explore different parameters of the retrosynthesis workflow and their impact on the performance of the route scoping. They propose new set of parameters, other than the default, to assist in improving the odds of the software finding a route for diverse of set of molecules. First of its kind look into an approach to identify such a set.

Hybrid neural-symbolic approach for both retrosynthesis and reaction prediction that can be trained with large reaction sets from databases. Template extraction from known reaction datasets to classify new reaction to known reaction classes.

Seidl, Philipp, et al. “Improving Few-and Zero-Shot Reaction Template Prediction Using Modern Hopfield Networks.” Journal of chemical information and modeling 62.9 (2022): 2111-2120.

Introduce a template-based single-step retrosynthesis model based on Modern Hopfield Networks, which learn an encoding of both molecules and reaction templates in order to predict the relevance of templates for a given molecule. The model does not consider templates as distinct categories, but can leverage structural information about the template. The retrieval approach enables generalization across templates, which makes zero-shot learning possible and improves few-shot learning. On the single-step retrosynthesis benchmark USPTO-50k, the MHN model reaction reaches the state-of-the-art at top-k accuracy for k ≥ 3.

Tu, Zhengkai, and Connor W. Coley. “Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction.” Journal of Chemical Information and Modeling (2021).

Graph2SMILES, a template-free retrosynthesis model to predict reaction outcomes and retrosynthesis routes. This model eliminates the need for any input-side SMILES augmentation, while achieving noticeable improvements over Transformer baselines (especially for top-1 accuracy).

Generate reaction networks:

Newest version of RMG (v3) is updated to Python v3. It has ability to generate heterogeneous catalyst models, uncertainty analysis to conduct first order sensitivity analysis. RMG dataset for the thermochemical and kinetic parameters have been expanded.

More and Faster: Simultaneously Improving Reaction Coverage and Computational Cost in Automated Reaction Prediction Tasks

Presents an algorithmic improvement to the reaction network prediction task through their YARP (Yet Another Reaction Program) methodology. Shown to reduce computational cost of optimization while improving the diversity of identified products and reaction pathways.

Look at exploration of reaction space rather than compound space. SOAP kernel for representing the moelcules. Estimate atomization energy for the molecules using ML. Calculate the d(AE) for different ML-estimated AEs. Reaction energies (RE) are estimated and uncertainty propogation is used to estimate the errors. Uncorrelated constant error propogation. 30,000 bond breaking reaction steps Rad-6-RE network used. RE prediction is not as good as AE.

Estimate molecular synthesizability

The idea of estimating whether a molecule is ‘synthesizable’ can be thought of from two areas: 1. Complexity based - compare the fragments in the molecule to the known fragments in the chemical space
2. Full retrosynthesis based - entire route is considered for molecule generation. Reactant complexity drives route complexity.

Li, Junren, Lei Fang, and Jian-Guang Lou. “Retro-BLEU: quantifying chemical plausibility of retrosynthesis routes through reaction template sequence analysis.” Digital Discovery (2024).

Retro-BLEU, a statistical metric adapted from the well-established BLEU score in machine translation, to evaluate the plausibility of retrosynthesis routes based on reaction template sequences analysis. The authors use PaRoute to validate this approach.

Ertl, Peter, and Ansgar Schuffenhauer. “Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.” Journal of cheminformatics 1.1 (2009): 1-11.. RDkit implementation

Synthetic Accessbility score (SA_Score) is a popular heuristic score for quantifying synthesizability. It computes a score using a fragment-contribution approach, where rarer fragments (as judged by their abundance in the PubChem database of 1mil representative cmpds) are taken as an indication of lower synthesizability.

Coley, Connor W., et al. “SCScore: synthetic complexity learned from a reaction corpus.” Journal of chemical information and modeling 58.2 (2018): 252-261.. DeepChem implementation

SCScore is a learned synthetic complexity score computed as a neural network model trained on reaction data from the Reaxys database. It was designed with synthesis planning in mind to operate on molecules resembling not just drug-like products but intermediates and simpler building blocks as well.

Liu, Cheng-Hao, et al. “RetroGNN: Fast Estimation of Synthesizability for Virtual Screening and De Novo Design by Learning from Slow Retrosynthesis Software.” Journal of Chemical Information and Modeling 62.10 (2022): 2293-2300.

RetroGNN is a graph neural network based model to predict outcome of a synthesis planner given the target molecule. Shown to better perform than SAScore. Code is yet to be released.

Chen, Shuan, and Yousung Jung. “Estimating the synthetic accessibility of molecules with building block and reaction-aware SAScore.” Journal of Cheminformatics 16 (2024).

Authors introduce BR-SAScore, an enhanced version of SAScore that integrates the available building block information (B) and reaction knowledge (R) from synthesis planning programs into the scoring process. The score can also identify fragment contributing to the synthetic infeasibility.

Parrot, Maud, et al. “Integrating synthetic accessibility with AI-based generative drug design.” Journal of Cheminformatics 15.1 (2023): 83.

From team at Iktos for triaging molecule designs. The group introduces (retro-score) RScore and RSPred (derived score from RScore using NN). RScore is computed through a full retrosynthesis analysis. The R2 value for RSPred is 0.75.

Data-driven chemistry modeling and reaction optimization

Review / Perspectives

Raghavan, Priyanka, et al. “Dataset design for building models of chemical reactivity.” ACS Central Science 9.12 (2023): 2196-2204.

Authors discuss the design of reaction datasets in ways that are conducive to data-driven modeling, emphasizing the idea that training set diversity and model generalizability rely on the choice of molecular or reaction representation. They lay down the experimental constraints associated with generating common types of chemistry datasets and how these considerations should influence dataset design and model building.

Maloney, Michael P., et al. “Negative Data in Data Sets for Machine Learning Training.” Organic Letters (2023).

Thoughts from industry practioners on how to label low/no yield reactions in electronic lab notebooks (eLNs). This is important when building ML model for reaction outcomes.

Industrial reactions commentary

Substrate Scoping

Area to understand the coverage of chemical space by a specific reaction transformation. Knowing which substrates can be used for a specific type of reactions can accelerate the generation of HTE datasets, and also reduce wastage and failures in searching for right substrates. Every new reaction protocol which is proposed would have a corresponding set of amenable ‘action-space’ for the ligands.

Rana, D., Pflüger, P. M., Hölter, N. P., Tan, G., & Glorius, F. (2024). Standardizing Substrate Selection: A Strategy toward Unbiased Evaluation of Reaction Generality. ACS Central Science, 10(4), 899-906.

The authors report a standardized substrate selection strategy which mitigates biases found in traditional substrate scoping tables. This way the chemists can showcase unbiased applicability of novel methodologies facilitating their practical applications.

Integration of data science techniques, including DFT featurization, dimensionality reduction, and hierarchical clustering, to delineate a diverse and succinct collection of aryl bromides that is representative of the chemical space of the substrate class

On the Topic of Substrate Scope

Articles

Pomberger, Alexander, et al. “The effect of chemical representation on active machine learning towards closed-loop optimization.” Reaction Chemistry & Engineering 7.6 (2022): 1368-1379.

Lapkin and co look at the effect of chemical representation on reaction performance and condition prediction tasks. They look at the high throughput experientation generated datasets and the impact of calculated chemical descriptors on the prediction of reaction yields. They show tailored descriptions did not outperform the traditional ones but larger initial data accelerated reaction performance.

Haas, Brittany, et al. “Rapid Prediction of Conformationally-Dependent DFT-Level Descriptors using Graph Neural Networks for Carboxylic Acids and Alkyl Amines.” (2024).

2D and 3D-aware GNNs to predict DFT descriptors for conformationally flexible molecules, focusing on carboxylic acid and amines in particular.

Wang, J.Y., Stevens, J.M., Kariofillis, S.K. et al. Identifying general reaction conditions by bandit optimization. Nature 626, 1025–1033 (2024).

Latest from Abigail Doyle’s group where they use bandit optimization routine, related to Thompson sampling, to find reaction condition.

Götz, Julian, et al. “High-throughput synthesis provides data for predicting molecular properties and reaction success.” Science advances 9.43 (2023). Github

The authors propose a platform that is built for looking at photocatalytic N-heterocycle synthesis connected with HTE, automated purification, and physicochemical assays. Implement train-test split in 3 different strategies to minimize ligand overlap.

Experimental design using Bayesian Optimization. Look at 3 rxn class with multiple reaction parameters - temp solvent ligand. Algorithm identifies the optimal conditions. Variables looked into: ligands, bases, solvents, temperatures, concentrations. Algorithm arrived at 99% yields consistently - which was possible by using unusual ligand not known to work well (cognitive bias).

Also called Kraken - a discovery platform covering monodentate organophosphorus(III) ligands providing comprehensive physicochemical descriptors based on representative conformer ensembles. Using quantum-mechanical methods, the authors calculated descriptors for 1558 ligands, including commercially available examples, and trained machine learning models to predict properties of over 300000 new ligands.

Multi-objective optimization of catalytic reactions that employ chiral bisphosphine ligands. Optimization of 2 sequential reactions in asymmetric synthesis of API. Classification method identify active catalysts – 5% yield (user provided) cutoff for binary classification. Linear regression to model reaction selectivity. DFT-derived descriptor dataset of >550 bisphosphine ligands. Develop an interpretable chemical space mapping tool using PCA. Look at the domain of applicability with the euclidean distance in chemical space.

Generate catalysts

Schilter, Oliver, et al. “Designing catalysts with deep generative models and computational data. A case study for Suzuki cross coupling reactions.” Digital Discovery (2023).

Use VAE and RNN to propose new catalyst for Suzuki cross-coupling reaction. The trained models are used to find catalyst’s binding energy and find high percentage of novel and valid designs.

Databases

Avila, Claudio, et al. “Chemistry in a graph: modern insights into commercial organic synthesis planning.” Digital Discovery (2024).

Team from Pfizer use Graph Datasets and Network visualization to show how process chemistry data (GLP1 inhibitor Lotiglipron in this case) can be stored, queried, and used for illustration purposes. They demonstrate the utility of knowledge graph for optimizing the route selection process. Neo4J is used for querying the dataset.

Reaction sanitization

Reaction data extraction

Automated chemistry workflows

Reviews

Self-Driving Laboratories for Chemistry and Materials Science

This review article provides an in-depth analysis of the state-of-the-art in SDL technology, its applications across various scientific disciplines, and the potential implications for research, and industry. This review additionally provides an overview of the enabling technologies for SDLs, including their hardware, software, and integration with laboratory infrastructure. Most importantly, this review explores the diverse range of scientific domains where SDLs have made significant contributions, from drug discovery and materials science to genomics and chemistry.

Account of Eli Lilly and Company’s ASL (Automated Synthesis Lab)

Articles

DNA-encoded Libraries

Matthew Clark, et. al. DNA-encoded small-molecule libraries (DEL). C&EN article on the topic

New form of storing huge amounts of molecule related data using DNA. Made partially possible by low cost of DNA sequencing. Each molecule in the storage is attached with a DNA strand which encode information about its recipe.

Follow up to the work with Machine Learning for hit finding.

DNA encodings for discovery of novel small-molecule protein inhibitors. Outline a process for building a ML model using DEL. Compare graph convolutions to random forest for classification tasks with application to protein target binding. Graph models seemed to achieve high hit rate comapred to random forest. Apply diversity, logistical, structural filtering to search for novel candidates. First work to use GCN for hit searching.

Propose a way to incoporate 3D-spatial information in the DEL read outs to denoise the data.

Zhang, Chris, et al. “Building Block-Based Binding Predictions for DNA-Encoded Libraries.” (2023). Github

Set of informatic tools to look at BBs producitivity in DEL screens and guide designs for new DELs. Authors calculate joint probabilities of the BBs for its activity and find increasing binding metric for individual BBs also increases the overall binding energy. The authors then cluster these BBs using 2D and 3D tanimoto FPs (3D Tanimoto Combo) and HDBSCAN clustering. Good workflow for implementing 3D-based ROCs filtering.

Large Language Models (LLMs)

It’s a stretch to say that GPT-4 or any other LLM understands Chemistry.

At this point, LLMs seem to have two general use cases. First, summarization and information retrieval. LLMs can parse vast collections of text, which can be queried using natural language. These information retrieval capabilities have many applications, from writing computer code and collating clinical trial results to summarizing papers on a specific topic.

While there are still issues with LLMs hallucinating and providing incorrect information, tools and strategies are being developed to ensure the validity of LLM responses.

The other area where LLMs appear to be making inroads is workflow management or tools orchestration. Many activities in drug discovery, whether computational or experimental, require long sequences of steps, which can be tedious to orchestrate. These include asking questions about data, analyze results, do routine post processing for comparing with known state of the project.

While it is often possible to script the execution of these steps, scripting requires a detailed knowledge of each step. LLMs have the potential to simplify this process and carry out multi-step procedures given only a set of initial conditions and a final objective. While the amount of progress the field has made in a short time is impressive, I don’t see LLMs replacing scientists any time soon.

Previously the field has propose assistants for this job here which comprised of pre-scripted set of rules and processes. While tedious, they seem to add lot of value to project teams for quickly analyzing the SAR. The hope is LLMs might make the dream of all encompasing assistant a reality.

Reviews

Agents

PaperQA: Retrieval-Augmented Generative Agent for Scientific Research

PaperQA, a Retrieval-Augmented Generation (RAG) agent for the scientific literature. PaperQA begins by constructing LLM search queries from a set of keywords. The results of these searches are aggregated into a vector database and combined with a pre-trained LLM to create a summary of the search results. In benchmark comparisons, the differences between answers provided by PaperQA and human evaluators were similar to differences between individual human evaluators. Encouragingly, unlike many other LLMs, PaperQA didn’t hallucinate citations.

Bran, Andres M., et al. “ChemCrow: Augmenting large-language models with chemistry tools.” arXiv preprint arXiv:2304.05376 (2023).

ChemCrow provides software tools for performing domain-specific tasks, including web searches, file format conversions, and similarity searches. Compared with GPT-4, ChemCrow provided superior performance on tasks like synthetic route planning. The authors also point to potential misuse of LLMs and suggest mitigation strategies.

Boiko, Daniil A., Robert MacKnight, and Gabe Gomes. “Emergent autonomous scientific research capabilities of large language models.” arXiv preprint arXiv:2304.05332 (2023).. Peer-review

Coscientist, a set of LLMs for designing and executing organic syntheses. Coscientist consists of four components designed to search the web, write Python code, extract information from documentation, and program laboratory robotics. The authors test Coscientist using several open and closed-source LLMs and present examples of the system’s ability to plan and execute simple organic syntheses.

Framework that optimizes an LLM agent to use the provided tools. This framework is integrated in DSPy

Swanson, Kyle, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. “The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation.” bioRxiv (2024): 2024-11.

A joint paper from James Zou (Stanford) and Chan-Zuckerberg foundation showcases a virtual lab comprising of AI Agentic personas of typical research group collaborating together to design nanobodies. Interesting idea.

Generative Design

Wang, Haorui, et al. “Efficient Evolutionary Search Over Chemical Space with Large Language Models.” arXiv preprint arXiv:2406.16976 (2024).

Introduce LLMs for conducting evolutionary algorithm searches.

Predictive modeling

Authors show GPT3 based predictive models perform on-par with SOTA with lower data points. Caution is the models are purely text-based and extreme black box and sometimes, while trite, correlation doesnt mean causation might become important here. Finally the fine tuning doesnt do regression on the data in same sense as a linear regression or random forest would do.

The team is looking at creating an LLM-based predictive model (regression and classification). They show that one large model can be used to predict multiple end points (think of one model used for all ADME endpoints), and it indicates that training on a variety of tasks can improve overall performance (positive transfer learning).

I am glad to see this work as it shows how much information and feature-richness can be encoded within the transformer model, especially in the low-data regime; that said, one caution with this approach is that the models are purely text-based and extremely black box and correlation doesn’t mean causation, more so here since we don’t have good control over features being used to train the model.

Sirumalla, S. K., Farina Jr, D. S., Qiao, Z., Di Cesare, D. A., Farias, F. C., O’Connor, M. B., … & Miller, T. Multi-Modal and Multi-Task Transformer for Small Molecule Drug Discovery. In ICML’24 Workshop ML for Life and Material Science: From Theory to Industry Applications.

From Iambic team: 1B-parameter transformer model pre-trained on 2.25 trillion tokens from diverse datasets focused on drug discovery. It details a comprehensive data pipeline for standardizing and processing data from various sources. The model architecture is based on LLaMA-2 and includes advanced features like SwishGLU and Rotary Positional Encoding. The fine-tuned model outperforms strong baselines in assay prediction tasks.

Small molecule related tasks

BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task in struction tuning for generality across tasks, and a novel numerical tokenization technique for improved processing of numerical data.

Protein design and mechanics

Queen, Owen, Yepeng Huang, Robert Calef, Valentina Giunchiglia, Tianlong Chen, George Dasoulas, LeAnn Tai et al. “ProCyon: A multimodal foundation model for protein phenotypes.” bioRxiv (2024): 2024-12.

ProCyon integrates phenotypic and protein data. Authors show its use for identifying protein domains that bind small molecule drugs, predicting peptide binding with enzymes, and assessing the functional impact of Alzheimer’s disease mutations. ProCyon enables conditional retrieval of proteins linked to small molecules through complementary mechanisms of action

Mixture of LoRA Experts: Leverage the power of fine-tuned LoRA experts by employing a mixture of experts, or MoE technique.

Ghafarollahi, Alireza, and Markus J. Buehler. “ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning.” arXiv preprint arXiv:2402.04268 (2024).

Clinical text

Van Veen, D., Van Uden, C., Blankemeier, L. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med (2024).. Github

Authors look at clinical summarization and implement quantitative assesments with synctactic, semantic, and conceptual NLP metrics. A clinical reader study with 10 physicians evaluated summary completeness, correctness and conciseness; in most cases, summaries from our best-adapted LLMs were deemed either equivalent (45%) or superior (36%) compared with summaries from medical experts. The research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.

Saab, K.; et. al. Capabilities of Gemini Models in Medicine. arXiv May 1, 2024.

Google’s team shows Med-Gemini’s real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education.

Data curation

Extracting Structured Data from Free-form Organic Synthesis Text

Hackathon to quickly fine-tune GPT to parse synthesis data and extract relevant chemistry-related information.

Material science

Gruver, Nate, et al. “Fine-Tuned Language Models Generate Stable Inorganic Materials as Text.” arXiv preprint arXiv:2402.04379 (2024).

Agent knowledge

Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N., & Goodman, N. D. (2024). Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629.

Data extraction

Rule-based tool

LeadMine - NextMove

AI-based tool

Despite so much progress around computer vision and optical character recognition (OCR) the state of the art for molecule image conversion to structure still remains to be manual curation. There have been some interesting tools proposed for automating this using different flavor of computer-vision algorithms.

One of the core reasons this area has been under explored seems to be molecule patents are MADE to be tough to decipher. The format is non standard and markush enumerations, alongside, their actual chemical space coverage is ill-defined.

DECIMER

DECIMER Image Transformer: Deep Learning for Chemical Image Recognition using Efficient-Net V2 + Transformer. V1. Extraction of chemical structure through OSCR (Optical chemical structure recognition) from Steinbeck’s group.

Fan, Vincent, et al. “OpenChemIE: An information extraction toolkit for chemistry literature.” Journal of Chemical Information and Modeling (2024).

Focused on the extraction of reaction data from journals. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format.

Ai, Q., Meng, F., Shi, J., Pelkie, B. G., & Coley, C. W. (2024). Extracting structured data from organic synthesis procedures using a fine-tuned large language model. Digital Discovery.

Using Llama-2 7b to extract entities from synthesis recipes from reactions.

Code / Packages:

Automates the selection of decision threshold for imbalanced classification task. The assumption for this method to work is the similar characteristics (like imbalance ratio) of training and test data.

MOSES - Benchmarking platform for generative models (PyTorch Implementation). Github

Benchmarking platform to implement molecular generative models. It also provides a set of metrics to evaluate the quality and diversity of the generated molecules. A benchmark dataset (subset of ZINC) is provided for training the models.

Reinvent 4.0 - an AI tool forr de novo drug design. Github

Production-ready tool for de novo design from Astra Zeneca. It can be effectively applied on drug discovery projects that are striving to resolve either exploration or exploitation problems while navigating the chemical space. Language model with SMILE output and trained by “randomizing” the SMILES representation of the input data. Implement reinforcement-leraning for directing the model towards relevant area of interest. Now uses PyTorch 2.0!

DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology - from Github

ChemProp (Pytorch)

Github repository for implmenting message passing neural networks for molecular property prediction as described in the paper Analyzing Learned Molecular Representations for Property Prediction by Yang et. al.

Tool to generate chemical reaction networks. Includes Arkane, package for calculating thermodynamics from quantum mechanical calculations.

PyePAL

Active learning approach to efficiently and confidently identify the Pareto front with any regression model that can output a mean and a standard deviation.

rxnfp

Github repository to generate chemical reaction fingerprints from reaction SMILES.

mols2grid

Interactive chemical viewer for small molecules (RDKit wrapper)

molplotly

Spotfire like capabilities to jupyter notebook. Medium article on explaining the MolPlotly. Link

ESPsim

Calculate similarities of shapes and electrostatic potentials between molecules. Pen has a nice blogpost on using to estimate electronic similarities of common bioisosteres. blog

HotSpots: Curran, Peter R., et al. “Hotspots api: a python package for the detection of small molecule binding hotspots and application to structure-based drug design.” Journal of chemical information and modeling 60.4 (2020): 1911-1916.

Survey protein surfaces for binding hotspots can help to evaluate target tractability and guide exploration of potential ligand binding regions.

MolPal

Active learning methodology for sampling the chemical space

Chemprop version that combines Jazzy (AZ’s workflow for predicting H-bond strength)

Datasets & Chemical libraries

Molecule datasets

PubChem: public sourced molecules
ChEMBL: bioactive molecules (most synthetic)
SUREChEMBL: small molecules appearing in Patents
ZINC: collection of synthetic molecules (not all are bioactive)
QM 7/8/9: small molecules having not more than 7/8/9 heavy atoms
GDB-17
Papyrus
COCONUT: NP 400k there are some which are not NP
Mcule: Used in DEL enumerations
DrugBank

Reaction Datasets

USPTO
Pistachio
Reaxys
Open Reaction Database

Commercial (building block) vendors

eMolecules
Enamine
WuXi
Chembridge
Asinex
Molport
Pharmablock
Otava’s CHEMriya

Helpful utilities:

Therapeutics Data Commons is an open-science platform with AI/ML-ready datasets and learning tasks for therapeutics, spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaning