Materialinformatics Literature and Resources
List of resources and stateoftheart for Material Informatics
 Special Issues and Collections:
 Reviews:
 Featurizations possible:
 Material modeling benchmark studies:
 Articles:
 Global optimization methods:
 Uncertainty quantification (UQ):
 Active learning:
 Surrogate optimizer and accelerating TS searches:
 Combining experiments + theory:
 Reaction Network Predictions:
 Generative Models:
 Datasets:
 Packages:
Last update: 4th July 2021
Material Informatics is the solidstate, inorganic chemistry focused cousin to its organic chemistry contemporary: Cheminformatics. In spirit, the aim of Material Informatics is similar to Cheminformatics; it offers a promising avenue to augment traditional material R&D processes. Amplify the conventional material discovery task using data, analytics, and identify chemical spaces, and structure in the data, which are interesting and probe those rigorously using firstprinciples techniques and/or experimentation.
The potential application of material informatics can be seen in: Microelectronics, aerospace, and automotive to defense, clean energy, and health services, where ever there’s a demand for new advanced materials at even greater rates and lower costs.
Application of material informatics in atomicscale modeling:
In case of molecularlevel modeling of material properties, concepts developed in material informatics, statistics, and ML can be used for:

Descriptor driven screening of computational models

Discover new science and relations from large computational datasets

Applying surrogate models to enable fast materials development

Undertake global optimization routines using surrogate models for composition and property predictions.
Machine learning in atomicscale modeling is often used to replace expensive ab initio methods with cheaper approximations. While certainly lucractive an additional consideration for ML usecase is its utility as a surrogate model to help researchers identify interesting regions in the material space. It also helps to decode the ‘intuition’ and serendipity involved in material development and hopefully provide a rigorous data driven basis for a design decision.
Below are few reviews, articles, and resources I’ve found that document the stateoftheart for material informatics. It goes without saying that this is a highly biased and a nonexhaustive listing of articles covering only the ones I’ve read. The idea with this document is to provide a starting point in understanding the general status of the field.
Special Issues and Collections:

Matter journal’s Material prediction using data and ML prediction

Nature Communications compendium on ML for material modelling
Reviews:

C. Chen, Y. Zuo, W. Ye, X. Li, Z. Deng, and S. P. Ong, “A Critical Review of Machine Learning of Energy Materials,” Adv. Energy Mater., vol. 1903242, p. 1903242, Jan. 2020.

J. Schmidt, M. R. G. Marques, S. Botti, and M. A. L. Marques, “Recent advances and applications of machine learning in solidstate materials science,” npj Comput. Mater., vol. 5, no. 1, p. 83, Dec. 2019.

J. Noh, G. H. Gu, S. Kim, and Y. Jung, “Machineenabled inverse design of inorganic solid materials: promises and challenges,” Chem. Sci., vol. 11, no. 19, pp. 4871–4881, 2020.

S. M. Moosavi, K. M. Jablonka, and B. Smit, “The Role of Machine Learning in the Understanding and Design of Materials,” J. Am. Chem. Soc., no. Figure 1, p. jacs.0c09105, Nov. 2020.

F. Häse, L. M. Roch, P. Friederich, and A. AspuruGuzik, “Designing and understanding lightharvesting devices with machine learning,” Nat. Commun., vol. 11, no. 1, pp. 1–11, 2020.

M. Moliner, Y. RománLeshkov, and A. Corma, “Machine Learning Applied to Zeolite Synthesis: The Missing Link for Realizing HighThroughput Discovery,” Acc. Chem. Res., vol. 52, no. 10, pp. 2971–2980, 2019.
Best practices in material informatics:
A. Y. T. Wang et al., “Machine Learning for Materials Scientists: An Introductory Guide toward Best Practices,” Chem. Mater., vol. 32, no. 12, pp. 4954–4965, 2020.
Featurizations possible:
Similar to other machinelearning development efforts – featurization or descriptors used to convert material entries in machinereadable format is crucial for the eventual performance of any statistical model. Over the years there has been tremendous progress in describing the periodic solid crystal structures. Some of the key articles I’ve liked are mentioned below:
Reviews:

A. P. Bartók, R. Kondor, and G. Csányi, “On representing chemical environments,” Phys. Rev. B  Condens. Matter Mater. Phys., vol. 87, no. 18, pp. 1–16, 2013.

A. Seko, H. Hayashi, K. Nakayama, A. Takahashi, and I. Tanaka, “Representation of compounds for machinelearning prediction of physical properties,” Phys. Rev. B, vol. 95, no. 14, pp. 1–11, 2017.

K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller, and E. K. U. Gross, “How to represent crystal structures for machine learning: Towards fast prediction of electronic properties,” Phys. Rev. B  Condens. Matter Mater. Phys., vol. 89, no. 20, pp. 1–5, 2014.
Articles:
1. Composition based:
Predicting properties of crystalline compounds using a representation consisting of attributes derived from the Voronoi tessellation of its structure and composition based features is both twice as accurate as existing methods and can scale to large training set sizes. Also the representations are insensitive to changes in the volume of a crystal, which makes it possible to predict the properties of the crystal without needing to compute the DFTrelaxed geometry as input. Random forrest algorithm used for the prediction
Using attentionbased graph networks on material composition to predict material properties.
Similar to the previous article in spirit, here authors use material composition to generate weighted graphs and predict material properties. Consider ensemblebased uncertainty estimates.
2. Structural based:
 T. Xie and J. C. Grossman, “Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties,” Phys. Rev. Lett., vol. 120, no. 14, p. 145301, 2018.
Material modeling benchmark studies:
Investigate if ML models can distinguish materials wrt thermodynamic stability and not just formation energies. Learning formation energy from composition alone is fine for MAE and RMSE representations. Propose that graphbased methods reduce the MAE by roughly 50% compared with the best performing compositional model. Show that including structural information is advantageous when predicting formation energies.
Consider various encoding scheme and machine learning models to predict single adsorbate binding energy for carbonbased adsorabtes on transition metal surfaces. They show linear methods and scaling relationship hold well compared to ML methods. They found that for ML models to succeed, it is not necessary to use advanced (geometric) coordinatebased descriptors; simple descriptors, such as bond count, can provide satisfactory results. As many catalysis and materials science problems require significant time to generate each data point, in many cases the ML models would need to work with a relatively smallsized dataset
Articles:
There is a rich and long history of using statistical model and data mining for predicting bulk inorganic crystal properties. The review articles mentioned in the above section discuss those areas quite nicely.
This section particularly focusses on works applying informatics to encode surfaces for modeling heterogeneous catalyst surfaces, which is fairly new and very active research direction:

Ma, X., Li, Z., Achenie, L.E.K., and Xin, H. (2015). Machinelearningaugmented chemisorption model for CO2 electroreduction catalyst screening. J. Phys. Chem. Lett. 6, 3528–3533.

F. Liu, S. Yang, and A. J. Medford, “Scalable approach to high coverages on oxides via iterative training of a machinelearning algorithm,” ChemCatChem, vol. 12, no. 17, pp. 4317–4330, 2020.

C. S. Praveen and A. ComasVives, “Design of an Accurate Machine Learning Algorithm to Predict the Binding Energies of Several Adsorbates on Multiple Sites of Metal Surfaces,” ChemCatChem, vol. n/a, no. n/a, 2020.

Z. Li, L. E. K. Achenie, and H. Xin, “An Adaptive Machine Learning Strategy for Accelerating Discovery of Perovskite Electrocatalysts,” ACS Catal., vol. 10, no. 7, pp. 4377–4384, 2020.

R. GarcíaMuelas and N. López, “Statistical learning goes beyond the dband model providing the thermochemistry of adsorbates on transition metals,” Nat. Commun., vol. 10, no. 1, p. 4687, Dec. 2019.

M. Rueck, B. Garlyyev, F. Mayr, A. S. Bandarenka, and A. Gagliardi, “Oxygen Reduction Activities of Strained Platinum CoreShell Electrocatalysts Predicted by Machine Learning,” J. Phys. Chem. Lett., 2020.

W. Xu, M. Andersen, and K. Reuter, “DataDriven Descriptor Engineering and Refined Scaling Relations for Predicting Transition Metal Oxide Reactivity,” ACS Catal., vol. 11, no. 2, pp. 734–742, Jan. 2021.

Liu, F., Yang, S. & Medford, A. J. Scalable approach to high coverages on oxides via iterative training of a machinelearning algorithm. ChemCatChem 12, 4317–4330 (2020).
Graphnetwork based approaches for encoding and predicting surface binding energies:

Back, S. et al. Convolutional Neural Network of Atomic Surface Structures to Predict Binding Energies for HighThroughput Screening of Catalysts. J. Phys. Chem. Lett. 10, 4401–4408 (2019)

Lym, J., Gu, G. H., Jung, Y. & Vlachos, D. G. Lattice convolutional neural network modeling of adsorbate coverage effects. J. Phys. Chem. C 123, 18951–18959 (2019).
Adsorbate binding predictions have been recently extended to cover highentropy alloy surfaces as well:

T. A. A. Batchelor et al., “Complex solid solution electrocatalyst discovery by computational prediction and high‐throughput experimentation,” Angew. Chemie Int. Ed., p. anie.202014374, Dec. 2020.

J. K. Pedersen, T. A. A. Batchelor, D. Yan, L. E. J. Skjegstad, and J. Rossmeisl, “Surface electrocatalysis on highentropy alloys,” Curr. Opin. Electrochem., vol. 26, p. 100651, Apr. 2021.

Z. Lu, Z. W. Chen, and C. V. Singh, “Neural NetworkAssisted Development of HighEntropy Alloy Catalysts: Decoupling Ligand and Coordination Effects,” Matter, vol. 3, no. 4, pp. 1318–1333, 2020.
Miscellaneous
CGCNN as a binary classification for synthesizability. The metric is identified only for the positive cases (that is experimental data) and used a proxy to train the model to learn what makes the material positive.
Global optimization methods:

M. K. Bisbo and B. Hammer, “Efficient global structure optimization with a machine learned surrogate model,” Phys. Rev. Lett., vol. 124, no. 8, p. 86102, 2019.

J. Dean, M. G. Taylor, and G. Mpourmpakis, “Unfolding adsorption on metal nanoparticles: Connecting stability with catalysis,” Sci. Adv., vol. 5, no. 9, 2019.
Uncertainty quantification (UQ):
Method to comment on the uncertainty of DFT errors which accounts for both sources of uncertainty: experimental and model parameters. Fit energy corrections using a set of 222 binary and ternary compounds for which experimental and computed values are present. Quantifying this uncertainty can help reveal cases wherein empiricallycorrected DFT calculations are limited to differentiate between stable and unstable phases. Validate this approach on ScWO phase diagram analysis.
Propose Bayesian networks, type of probabilistic graphical models, to integrate physics and chemistrybased data and uncertainty. Demonstrate this framework in searching for the optimal reaction rate and oxygen binding energy for the oxygen reduction reaction (ORR) using the volcano model. Their model is able to comment on the source of uncertainty in the model.
Helpful overview and benchmark of various model flavors and metrics to understand ways of reporting the confidence in model predictions for material properties. Interesting convolutionFed Gaussian Process (CFGP) model framework looked into which is a combination of CGCNN and GP: pooled outputs of the convolutional layers of the network as features in a new GP. This was also their best model from the collection. Nice overview of different metrics used for comparing methods for UQ.
Active learning:

A. Seko and S. Ishiwata, “Prediction of perovskiterelated structures in ACuO3x (A = Ca, Sr, Ba, Sc, Y, La) using density functional theory and Ba,” Phys. Rev. B, vol. 101, no. 13, p. 134101, Apr. 2020.

K. Tran and Z. W. Ulissi, Active learning across intermetallics to guide discovery of electrocatalysts for CO2 reduction and H2 evolution, vol. 1, no. 9. Springer US, 2018.

D. Xue, P. V. Balachandran, J. Hogden, J. Theiler, D. Xue, and T. Lookman, “Accelerated search for materials with targeted properties by adaptive design,” Nat. Commun., vol. 7, pp. 1–9, 2016.

Deshwal A, Simon C, Doppa JR. Bayesian optimization of nanoporous materials. ChemRxiv. 2021
Active learning algorithm to find Pareto front for multiobjective optimization. Apply algorithm to denovo polymer design. Ranking materials in a multiobjective optimization tasks is sometimes biased. Instead of ranking the candidates, the authors want to identify an approximate pareto front. Selection of candidates happens based on their promixity to the pareto front, which itself is defined by following geometric rules.
Surrogate optimizer and accelerating TS searches:

O.P. Koistinen, F. B. Dagbjartsdóttir, V. Ásgeirsson, A. Vehtari, and H. Jónsson, “Nudged elastic band calculations accelerated with Gaussian process regression,” J. Chem. Phys., vol. 147, no. 15, p. 152720, Oct. 2017.

J. A. Garrido Torres, P. C. Jennings, M. H. Hansen, J. R. Boes, and T. Bligaard, “LowScaling Algorithm for Nudged Elastic Band Calculations Using a Surrogate Machine Learning Model,” Phys. Rev. Lett., vol. 122, no. 15, pp. 1–6, 2019.

E. Garijo del Río, J. J. Mortensen, and K. W. Jacobsen, “Local Bayesian optimizer for atomic structures,” Phys. Rev. B, vol. 100, no. 10, pp. 1–9, 2019.
Combining experiments + theory:

E. O. Ebikade, Y. Wang, N. Samulewicz, B. Hasa, and D. Vlachos, “Active learningdriven quantitative synthesis–structure–property relations for improving performance and revealing active sites of nitrogendoped carbon for the hydrogen evolution reaction,” React. Chem. Eng., 2020.

A. Smith, A. Keane, J. A. Dumesic, G. W. Huber, and V. M. Zavala, “A machine learning framework for the analysis and prediction of catalytic activity from experimental data,” Appl. Catal. B Environ., vol. 263, no. October 2019, p. 118257, 2020.

M. Zhong et al., Accelerated discovery of CO2 electrocatalysts using active machine learning, vol. 581, no. 7807. 2020.

A. J. Saadun et al., “Performance of MetalCatalyzed Hydrodebromination of Dibromomethane Analyzed by Descriptors Derived from Statistical Learning,” ACS Catal., vol. 10, no. 11, pp. 6129–6143, Jun. 2020.

Materials genes of heterogeneous catalysis from clean experiments and artificial intelligence

N. Artrith, Z. Lin, and J. G. Chen, “Predicting the Activity and Selectivity of Bimetallic Metal Catalysts for Ethanol Reforming using Machine Learning,” ACS Catal., vol. 10, no. 16, pp. 9438–9444, Aug. 2020.

S. Nellaiappan et al., “HighEntropy Alloys as Catalysts for the CO2 and CO Reduction Reactions: Experimental Realization,” ACS Catal., vol. 10, no. 6, pp. 3658–3663, 2020.
Reaction Network Predictions:
Newest version of RMG (v3) is updated to Python v3. It has ability to generate heterogeneous catalyst models, uncertainty analysis to conduct first order sensitivity analysis. RMG dataset for the thermochemical and kinetic parameters have been expanded.
Develop a multireactant representation scheme to look at arbitrary reactant product pairs. Apply this technique to understand electrochemical reaction network for Liion solid electrolyte interphase.
Chemical reaction network model to predict synthesis pathway for exotic oxides. Solidstate synthesis procedures for YMnO_{3}, Y_{2}Mn_{2}O_{7}, Fe_{2}SiS_{4}, and YBa_{2}Cu_{3}O_{6.5} are proposed and compared to literature pathways. Finally apply the algorithm to search for a probable synthesis route to make MgMo_{3}(PO_{4})_{3}O, battery cathode material that has yet to be synthesized.
Generative Models:
Review:
J. Noh et al., “Inverse Design of SolidState Materials via a Continuous Representation,” Matter, vol. 1, no. 5, pp. 1370–1384, 2019.
Articles:

S. Kim, J. Noh, G. H. Gu, A. AspuruGuzik, and Y. Jung, “Generative Adversarial Networks for Crystal Structure Prediction,” pp. 1–37, 2020

B. Kim, S. Lee, and J. Kim, “Inverse design of porous materials using artificial neural networks,” Sci. Adv., vol. 6, no. 1, 2020
Semantically constrained graphbased code for presenting a MOFs. Target property directed optimization. Encode MOFs as edges, vertices, topologies. Edges are molecular fragments with two connecting points, verticies contain node information, topologies indicate a definite framework. Supramolecular Variational Autoencoder (SmVAE) with several corresponding components that oversee encoding and decoding each part of the MOF: Map the frameworks with discrete representations (RFcodes) into continuous vectors (z) and then back.
Datasets:
While we can attribute the recent interest in material informatics to democratization of data analytics and ML packages, growing set of benchmark datasets of materials from multiple research institution has been crucial for development of new methods, algorithms and providing a consistent set of comparison.
Dataset comprising of surface heterogeneous adsorbates.
 Catalysis Hub from SUNCAT. Website
Surface Reactions database contains thousands of reaction energies and barriers from density functional theory (DFT) calculations on surface systems
Besides providing a collection of over 130,000 inorganic compounds and 49,000 molecules and counting, with calculated phase diagrams, structural, thermodynamic, electronic, magnetic, and topological properties it also provides analysis tools for postprocessing.
815,000+ materials with calculated thermodynamic and structural properties.
210,000+ inorganic crystal structures from literature. Requires subscription.
Includes calculated materials properties, 2D materials, and tools for ML and highthroughput tightbinding.
Structural, thermodynamic, elastic, electronic, magnetic, and optical properties of around 4000 twodimensional (2D) materials distributed over more than 40 different crystal structures.
Millions of materials and calculated properties, focusing on alloys.
 Citrination
Contributed and curated datasets from Citrine Informatics
Fascinating resource linking scientific publications using the Pauling File database (relational database of published literature for material scientists)
Packages:
Active learning approach to efficiently and confidently identify the Pareto front with any regression model that can output a mean and a standard deviation.