In chemistry, machine learning can be most fruitfully applied to three broad areas: revealing patterns, overcoming limitations, and accelerating analysis. Revealing Patterns Where Few or None Are Known Much of a modern scientist’s or engineer’s work involves the development of structure-property relationships. The researcher in the chemical sciences carries out experiments or computational modeling across […]
In chemistry, machine learning can be most fruitfully applied to three broad areas: revealing patterns, overcoming limitations, and accelerating analysis.
Revealing Patterns Where Few or None Are Known
Much of a modern scientist’s or engineer’s work involves the development of structure-property relationships. The researcher in the chemical sciences carries out experiments or computational modeling across a range of molecules or materials and then aims to understand what attributes of the chemical composition led to the observed outcome (e.g., reaction yield) or property (e.g., fluorescence). The scientist may then attempt to simplify the mapping between the set of molecules studied and the set of observed properties using a few quantities of the molecules, sometimes called “descriptors” or “features.” Machine learning models and techniques can complement and assist this traditional form of inquiry. For instance, linear regression that avoids overfitting (e.g., through cross-validation) can be used to identify the most informative descriptors for property prediction.
Collecting a large series of possible features or descriptors and evaluating their predictive performance, even in simple linear or kernel models on set-aside test data, can help to reveal previously unrecognized patterns in large data sets. Beyond set-aside test data from an original training data source, the generality of a relationship can also be tested by applying trained models to more distinct data sets. Furthermore, statistical analysis of the most important features, identified through feature selection or dimensionality-reduction techniques for multiple properties, can reveal the extent to which distinct properties can be independently tuned. For large data sets, statistical techniques can provide an excellent starting point for a structure-property relationship if no patterns are known.
Overcoming the Limits of Simple Models and Human Experience
As part of scientific inquiry, chemists frequently develop heuristics and theories to explain newly observed phenomena. However, people can be overconfident in heuristics and the extent of prior knowledge. Truly nonlinear or more complex relationships can be difficult to conceptualize but readily predicted by a neural network with precision beyond that of an expert. Even when using opaque models such as neural networks, their rapid evaluation can make it possible for the scientist to test predictions and reveal outcomes that were unknown previously. Human intuition and experience can be essential to guiding model development, but expert knowledge has a limit in transferability among researchers. Each new graduate student must consult and interpret the same literature or coursework to reach a specific level of expertise. Interaction between people and models can help to avoid excessive duplication of the time needed for knowledge acquisition. While the models need not replace traditional scientific inquiry, they can provide support to researchers.
Accelerating Computations and Analysis to Enable Rapid Discovery in Challenging Materials Spaces
Evaluation of a trained machine learning model is likely to be orders of magnitude faster than whichever method, experimental or computational, was used to generate the training data. Such a model can then be used, for example, to evaluate lead compounds if those compounds are similar enough to the training data for the model to be reliable. A key challenge is quantifying the uncertainty in model prediction to know when the limits of reliability have been reached. Alternatively, high-promise but low-certainty compounds can be exploited by turning this problem on its head in an approach known as “active learning,” wherein new experiments or calculations can be carried out to acquire the most useful data to enrich models. In all cases, we should ask, ‘What can we learn now that we did not know directly from the experiment or calculation?’