Last update: 4th July 2021

Material Informatics is the solid-state, inorganic chemistry focused cousin to its organic chemistry contemporary: Cheminformatics. In spirit, the aim of Material Informatics is similar to Cheminformatics; it offers a promising avenue to augment traditional material R&D processes. Amplify the conventional material discovery task using data, analytics, and identify chemical spaces, and structure in the data, which are interesting and probe those rigorously using first-principles techniques and/or experimentation.

The potential application of material informatics can be seen in: Microelectronics, aerospace, and automotive to defense, clean energy, and health services, where ever there’s a demand for new advanced materials at even greater rates and lower costs.

Application of material informatics in atomic-scale modeling:

In case of molecular-level modeling of material properties, concepts developed in material informatics, statistics, and ML can be used for:

  1. Descriptor driven screening of computational models

  2. Discover new science and relations from large computational datasets

  3. Applying surrogate models to enable fast materials development

  4. Undertake global optimization routines using surrogate models for composition and property predictions.

Machine learning in atomic-scale modeling is often used to replace expensive ab initio methods with cheaper approximations. While certainly lucractive an additional consideration for ML use-case is its utility as a surrogate model to help researchers identify interesting regions in the material space. It also helps to decode the ‘intuition’ and serendipity involved in material development and hopefully provide a rigorous data driven basis for a design decision.

Below are few reviews, articles, and resources I’ve found that document the state-of-the-art for material informatics. It goes without saying that this is a highly biased and a non-exhaustive listing of articles covering only the ones I’ve read. The idea with this document is to provide a starting point in understanding the general status of the field.

Special Issues and Collections:


  1. C. Chen, Y. Zuo, W. Ye, X. Li, Z. Deng, and S. P. Ong, “A Critical Review of Machine Learning of Energy Materials,” Adv. Energy Mater., vol. 1903242, p. 1903242, Jan. 2020.

  2. J. Schmidt, M. R. G. Marques, S. Botti, and M. A. L. Marques, “Recent advances and applications of machine learning in solid-state materials science,” npj Comput. Mater., vol. 5, no. 1, p. 83, Dec. 2019.

  3. J. Noh, G. H. Gu, S. Kim, and Y. Jung, “Machine-enabled inverse design of inorganic solid materials: promises and challenges,” Chem. Sci., vol. 11, no. 19, pp. 4871–4881, 2020.

  4. S. M. Moosavi, K. M. Jablonka, and B. Smit, “The Role of Machine Learning in the Understanding and Design of Materials,” J. Am. Chem. Soc., no. Figure 1, p. jacs.0c09105, Nov. 2020.

  5. F. Häse, L. M. Roch, P. Friederich, and A. Aspuru-Guzik, “Designing and understanding light-harvesting devices with machine learning,” Nat. Commun., vol. 11, no. 1, pp. 1–11, 2020.

  6. M. Moliner, Y. Román-Leshkov, and A. Corma, “Machine Learning Applied to Zeolite Synthesis: The Missing Link for Realizing High-Throughput Discovery,” Acc. Chem. Res., vol. 52, no. 10, pp. 2971–2980, 2019.

  7. Tao, H., Wu, T., Aldeghi, M. et al. Nanoparticle synthesis assisted by machine learning. Nat Rev Mater, 2021

Best practices in material informatics:

A. Y. T. Wang et al., “Machine Learning for Materials Scientists: An Introductory Guide toward Best Practices,” Chem. Mater., vol. 32, no. 12, pp. 4954–4965, 2020.

Featurizations possible:

Similar to other machine-learning development efforts – featurization or descriptors used to convert material entries in machine-readable format is crucial for the eventual performance of any statistical model. Over the years there has been tremendous progress in describing the periodic solid crystal structures. Some of the key articles I’ve liked are mentioned below:


  • A. P. Bartók, R. Kondor, and G. Csányi, “On representing chemical environments,” Phys. Rev. B - Condens. Matter Mater. Phys., vol. 87, no. 18, pp. 1–16, 2013.

  • A. Seko, H. Hayashi, K. Nakayama, A. Takahashi, and I. Tanaka, “Representation of compounds for machine-learning prediction of physical properties,” Phys. Rev. B, vol. 95, no. 14, pp. 1–11, 2017.

  • K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller, and E. K. U. Gross, “How to represent crystal structures for machine learning: Towards fast prediction of electronic properties,” Phys. Rev. B - Condens. Matter Mater. Phys., vol. 89, no. 20, pp. 1–5, 2014.


1. Composition based:

Predicting properties of crystalline compounds using a representation consisting of attributes derived from the Voronoi tessellation of its structure and composition based features is both twice as accurate as existing methods and can scale to large training set sizes. Also the representations are insensitive to changes in the volume of a crystal, which makes it possible to predict the properties of the crystal without needing to compute the DFT-relaxed geometry as input. Random forrest algorithm used for the prediction

Using attention-based graph networks on material composition to predict material properties.

Similar to the previous article in spirit, here authors use material composition to generate weighted graphs and predict material properties. Consider ensemble-based uncertainty estimates.

2. Structural based:

  • T. Xie and J. C. Grossman, “Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties,” Phys. Rev. Lett., vol. 120, no. 14, p. 145301, 2018.

Material modeling benchmark studies:

Investigate if ML models can distinguish materials wrt thermodynamic stability and not just formation energies. Learning formation energy from composition alone is fine for MAE and RMSE representations. Propose that graph-based methods reduce the MAE by roughly 50% compared with the best performing compositional model. Show that including structural information is advantageous when predicting formation energies.

Consider various encoding scheme and machine learning models to predict single adsorbate binding energy for carbon-based adsorabtes on transition metal surfaces. They show linear methods and scaling relationship hold well compared to ML methods. They found that for ML models to succeed, it is not necessary to use advanced (geometric) coordinate-based descriptors; simple descriptors, such as bond count, can provide satisfactory results. As many catalysis and materials science problems require significant time to generate each data point, in many cases the ML models would need to work with a relatively small-sized dataset


There is a rich and long history of using statistical model and data mining for predicting bulk inorganic crystal properties. The review articles mentioned in the above section discuss those areas quite nicely.

This section particularly focusses on works applying informatics to encode surfaces for modeling heterogeneous catalyst surfaces, which is fairly new and very active research direction:

  • Ma, X., Li, Z., Achenie, L.E.K., and Xin, H. (2015). Machine-learning-augmented chemisorption model for CO2 electroreduction catalyst screening. J. Phys. Chem. Lett. 6, 3528–3533.

  • F. Liu, S. Yang, and A. J. Medford, “Scalable approach to high coverages on oxides via iterative training of a machine-learning algorithm,” ChemCatChem, vol. 12, no. 17, pp. 4317–4330, 2020.

  • C. S. Praveen and A. Comas-Vives, “Design of an Accurate Machine Learning Algorithm to Predict the Binding Energies of Several Adsorbates on Multiple Sites of Metal Surfaces,” ChemCatChem, vol. n/a, no. n/a, 2020.

  • Z. Li, L. E. K. Achenie, and H. Xin, “An Adaptive Machine Learning Strategy for Accelerating Discovery of Perovskite Electrocatalysts,” ACS Catal., vol. 10, no. 7, pp. 4377–4384, 2020.

  • R. García-Muelas and N. López, “Statistical learning goes beyond the d-band model providing the thermochemistry of adsorbates on transition metals,” Nat. Commun., vol. 10, no. 1, p. 4687, Dec. 2019.

  • M. Rueck, B. Garlyyev, F. Mayr, A. S. Bandarenka, and A. Gagliardi, “Oxygen Reduction Activities of Strained Platinum Core-Shell Electrocatalysts Predicted by Machine Learning,” J. Phys. Chem. Lett., 2020.

  • W. Xu, M. Andersen, and K. Reuter, “Data-Driven Descriptor Engineering and Refined Scaling Relations for Predicting Transition Metal Oxide Reactivity,” ACS Catal., vol. 11, no. 2, pp. 734–742, Jan. 2021.

  • Liu, F., Yang, S. & Medford, A. J. Scalable approach to high coverages on oxides via iterative training of a machine-learning algorithm. ChemCatChem 12, 4317–4330 (2020).

Graph-network based approaches for encoding and predicting surface binding energies:

  • Back, S. et al. Convolutional Neural Network of Atomic Surface Structures to Predict Binding Energies for High-Throughput Screening of Catalysts. J. Phys. Chem. Lett. 10, 4401–4408 (2019)

  • Lym, J., Gu, G. H., Jung, Y. & Vlachos, D. G. Lattice convolutional neural network modeling of adsorbate coverage effects. J. Phys. Chem. C 123, 18951–18959 (2019).

Adsorbate binding predictions have been recently extended to cover high-entropy alloy surfaces as well:

  • T. A. A. Batchelor et al., “Complex solid solution electrocatalyst discovery by computational prediction and high‐throughput experimentation,” Angew. Chemie Int. Ed., p. anie.202014374, Dec. 2020.

  • J. K. Pedersen, T. A. A. Batchelor, D. Yan, L. E. J. Skjegstad, and J. Rossmeisl, “Surface electrocatalysis on high-entropy alloys,” Curr. Opin. Electrochem., vol. 26, p. 100651, Apr. 2021.

  • Z. Lu, Z. W. Chen, and C. V. Singh, “Neural Network-Assisted Development of High-Entropy Alloy Catalysts: Decoupling Ligand and Coordination Effects,” Matter, vol. 3, no. 4, pp. 1318–1333, 2020.


CGCNN as a binary classification for synthesizability. The metric is identified only for the positive cases (that is experimental data) and used a proxy to train the model to learn what makes the material positive.

Global optimization methods:

  • M. K. Bisbo and B. Hammer, “Efficient global structure optimization with a machine learned surrogate model,” Phys. Rev. Lett., vol. 124, no. 8, p. 86102, 2019.

  • J. Dean, M. G. Taylor, and G. Mpourmpakis, “Unfolding adsorption on metal nanoparticles: Connecting stability with catalysis,” Sci. Adv., vol. 5, no. 9, 2019.

Uncertainty quantification (UQ):

Method to comment on the uncertainty of DFT errors which accounts for both sources of uncertainty: experimental and model parameters. Fit energy corrections using a set of 222 binary and ternary compounds for which experimental and computed values are present. Quantifying this uncertainty can help reveal cases wherein empirically-corrected DFT calculations are limited to differentiate between stable and unstable phases. Validate this approach on Sc-W-O phase diagram analysis.

Propose Bayesian networks, type of probabilistic graphical models, to integrate physics- and chemistry-based data and uncertainty. Demonstrate this framework in searching for the optimal reaction rate and oxygen binding energy for the oxygen reduction reaction (ORR) using the volcano model. Their model is able to comment on the source of uncertainty in the model.

Helpful overview and benchmark of various model flavors and metrics to understand ways of reporting the confidence in model predictions for material properties. Interesting convolution-Fed Gaussian Process (CFGP) model framework looked into which is a combination of CGCNN and GP: pooled outputs of the convolutional layers of the network as features in a new GP. This was also their best model from the collection. Nice overview of different metrics used for comparing methods for UQ.

Active learning:

Active learning algorithm to find Pareto front for multi-objective optimization. Apply algorithm to de-novo polymer design. Ranking materials in a multi-objective optimization tasks is sometimes biased. Instead of ranking the candidates, the authors want to identify an approximate pareto front. Selection of candidates happens based on their promixity to the pareto front, which itself is defined by following geometric rules.

Surrogate optimizer and accelerating TS searches:

  • O.-P. Koistinen, F. B. Dagbjartsdóttir, V. Ásgeirsson, A. Vehtari, and H. Jónsson, “Nudged elastic band calculations accelerated with Gaussian process regression,” J. Chem. Phys., vol. 147, no. 15, p. 152720, Oct. 2017.

  • J. A. Garrido Torres, P. C. Jennings, M. H. Hansen, J. R. Boes, and T. Bligaard, “Low-Scaling Algorithm for Nudged Elastic Band Calculations Using a Surrogate Machine Learning Model,” Phys. Rev. Lett., vol. 122, no. 15, pp. 1–6, 2019.

  • E. Garijo del Río, J. J. Mortensen, and K. W. Jacobsen, “Local Bayesian optimizer for atomic structures,” Phys. Rev. B, vol. 100, no. 10, pp. 1–9, 2019.

Combining experiments + theory:

Reaction Network Predictions:

Newest version of RMG (v3) is updated to Python v3. It has ability to generate heterogeneous catalyst models, uncertainty analysis to conduct first order sensitivity analysis. RMG dataset for the thermochemical and kinetic parameters have been expanded.

Develop a multi-reactant representation scheme to look at arbitrary reactant product pairs. Apply this technique to understand electrochemical reaction network for Li-ion solid electrolyte interphase.

Chemical reaction network model to predict synthesis pathway for exotic oxides. Solid-state synthesis procedures for YMnO3, Y2Mn2O7, Fe2SiS4, and YBa2Cu3O6.5 are proposed and compared to literature pathways. Finally apply the algorithm to search for a probable synthesis route to make MgMo3(PO4)3O, battery cathode material that has yet to be synthesized.

Generative Models:


J. Noh et al., “Inverse Design of Solid-State Materials via a Continuous Representation,” Matter, vol. 1, no. 5, pp. 1370–1384, 2019.


Semantically constrained graph-based code for presenting a MOFs. Target property directed optimization. Encode MOFs as edges, vertices, topologies. Edges are molecular fragments with two connecting points, verticies contain node information, topologies indicate a definite framework. Supramolecular Variational Autoencoder (SmVAE) with several corresponding components that oversee encoding and decoding each part of the MOF: Map the frameworks with discrete representations (RFcodes) into continuous vectors (z) and then back.


While we can attribute the recent interest in material informatics to democratization of data analytics and ML packages, growing set of benchmark datasets of materials from multiple research institution has been crucial for development of new methods, algorithms and providing a consistent set of comparison.

Dataset comprising of surface heterogeneous adsorbates.

  • Catalysis Hub from SUNCAT. Website

Surface Reactions database contains thousands of reaction energies and barriers from density functional theory (DFT) calculations on surface systems

Besides providing a collection of over 130,000 inorganic compounds and 49,000 molecules and counting, with calculated phase diagrams, structural, thermodynamic, electronic, magnetic, and topological properties it also provides analysis tools for post-processing.

815,000+ materials with calculated thermodynamic and structural properties.

210,000+ inorganic crystal structures from literature. Requires subscription.

Includes calculated materials properties, 2D materials, and tools for ML and high-throughput tight-binding.

Structural, thermodynamic, elastic, electronic, magnetic, and optical properties of around 4000 two-dimensional (2D) materials distributed over more than 40 different crystal structures.

Millions of materials and calculated properties, focusing on alloys.

  • Citrination

Contributed and curated datasets from Citrine Informatics

Fascinating resource linking scientific publications using the Pauling File database (relational database of published literature for material scientists)


Active learning approach to efficiently and confidently identify the Pareto front with any regression model that can output a mean and a standard deviation.