ACS In Focus recently held a virtual event on “Machine Learning in Chemistry: Now and in the Future” with Jon Paul Janet, Senior Scientist at AstraZeneca and co-author of the ACS In Focus Machine Learning in Chemistry e-book.
This event had a brief discussion of Dr. Janet’s ACS In Focus e-book, a conversation on the future of machine learning, and a presentation on the exciting research Dr. Janet and his colleagues have recently done using machine learning to accelerate the search for new materials.
Below you can watch the recording of the webinar and view some questions your colleagues asked.
View the Webinar Recording:
Read Dr. Janet’s Answers to Community Questions
As a beginner without prior knowledge of programming language, how can one go into this field? What are the prerequisites one needs to acquire for this field?
I think learning some basic python scripting is the best way to get started, because there is a great community and tons of tools that can help make trying machine learning on chemical problems easy – sklearn and RDKit are amazing and get you quite far. But these need some familiarity with scripting at least.
How do you compare machine learning with quantitative structure-activity relationship (QSAR) modeling, which has been around for 30 years?
I would say “QSAR modeling” is a label for a specific type of machine learning (activity prediction based on molecular structure). Many QSAR techniques such as SVR/SVM and random forest are components of traditional/shallow machine learning, so we have been doing “machine learning in chemistry” for decades and in my experience, traditional QSAR methods are mostly competitive with deep learning approaches for affinity prediction especially. But I don’t think that is the whole story – new machine learning methods are letting us solve new types of problems such as generative models, retrosynthesis prediction, massive multitask predictions, etc. that don’t fit neatly into the QSAR label. In these cases, they are not really competing with other methods as much as expanding the type of data we can use and using it in new ways. If deep learning methods beat canonical QSAR approaches depends on who you ask, but in my experience, one is almost never worse off with ChemProp instead of a fingerprint method (though one might not be as much better off as one hopes). Neural networks also let us do interesting things in QSAR space such as multitask learning or even federated learning, and I think these approaches will be the standard in the future.
Can we simulate the effect of magnetized water clusterization through machine learning tools?
I don’t know much about water magnetization, but I know there has been some work in simulating the behavior of water clusters with neural networks. I am not sure if anyone has used these methods to infer magnetic properties though!
What are the job prospects for someone with a background in chemistry and computer science if he/she wants expertise in machine learning in the context of computer science?
Good question. There is definitely interest in industry, both in pharmaceuticals but also increasingly in materials design, including lots of startups in the last few years. I think there are roles for people with both more computer science experience and more chemistry experience. I think ideally a team would be comprised of people with both backgrounds.
What would be more appropriate machine learning techniques in de novo drug discovery?
This is what I work on now and it is an open question. I think SMILES-based generative models seem to be performant enough that more complex graph-generating methods don’t seem to pay off. I think the biggest issue is honestly getting useful scoring functions, so incorporating more multitask QSAR and physics-based/assisted approaches.
Do you envision a phase of disappointment after the current hype of machine learning, similar to what has happened with other promising technologies?
You can read about the “AI winter” of the 1980s. If you haven’t: machine learning has had this hype-crash cycle before. I think the current hype around general machine learning comes from a few high-profile technologies of the last decade-ish: convolutional neural nets for images, recurrent networks (and now transformers) for text, and advances in reinforcement learning for game playing. Each of these has followed in rapid progression and made a big difference in their respective domains and helped maintain and build hype. Without another flashy advance, I think we will see another slowdown. I am not sure we will see such a large ‘crash’ per se but I foresee more of a quieting down of interest, in some ways that might already be happening with, for example, generative models.
Does machine learning have applications in the design-of-experiment optimization process?
I would say machine learning is a big umbrella and yes, but I don’t put general optimization under machine learning. Instead, many of the DOE methods, at least the ones I use, depend on an explicit surrogate to predict how the objective function behaves on unseen points. Training of these surrogates, which are usually Gaussian processes or related, is a machine learning task. Maybe the DOE people would say the algorithmic choice of how you use this information is something else (i.e., not machine learning), but I don’t think the label really matters.
How could machine learning help ab initio algorithm minimize calculation times?
A whole lot of ways! One idea is to build a neural network to predict density functional theory (DFT) energies as a function of the structure, then you can run the simulation on a neural network potential, and only call the DFT to check on, and update, the potential as needed. Other clever people have used machine learning directly integrated into the hamiltonian, or to predict which orbital pairs to keep in a post-HF method. You can also accelerate geometry optimizations by bootstrapping a surrogate model at each step. In our work, we used machine learning models to construct good-quality starting geometries for our calculations, which reduces the number of optimization steps needed. We have also looked into using machine learning to predict when multireference methods are needed (vs. single-determinant DFT), which can help save a lot of time!
Do you have any comments or suggestions for the prediction of biological activity data? How do the new methodologies perform vs. QSAR models?
So we have been doing “machine learning in chemistry” for decades, the only difference now is that we have a larger toolbox of models that might or might not help build better activity models. QSAR methods in particular have benefited from a lot of optimization and seem to extract almost all the useful predictive power out of affinity data. If deep learning methods beat canonical QSAR approaches depends on who you ask, but in my experience, one is almost never worse off with ChemProp instead of a fingerprint method (though one might not be as much better off as one hopes). Neural networks also let us do interesting things in QSAR space such as large-scale multitask learning or even federated learning, and I think these approaches will be the standard in the future. These methods give us a way to overcome the typically limited amount of affinity data we have for a particular target by bringing in more information.
How accurate can the latest machine learning methods be in predicting the possible starting synthons for a designed novel molecule or polymer material?
I can only really comment that for drug-sized organic molecules we can actually make reasonable synthon predictions in many cases, e.g. machine learningPDS tools are used in the real world every day. As for polymers, I think there is a lot less published so I am not sure. In principle, yes, but it will depend on the data that is available. In my experience, these methods are extremely reliable at finding feasible, commercially available synthons for common disconnects (amide bonds, etc) but a little less reliable for more exotic chemistry. This is pretty cool because it shifts some of the busy work of synthetic chemists to focus on more interesting problems!
I am an organic chemistry Ph.D. student. I have started learning the basics of machine learning. Is it possible to work in the machine learning chemistry field in postdoctoral research, or is it already too late?
Definitely not! My Ph.D. group worked with a number of postdocs from purely chemistry backgrounds and there is a lot of domain experience that you gain from a Ph.D. that can be useful in applying machine learning methods. Probably I would recommend trying to join a group that does machine learning so you can learn from them. That said, being comfortable with python scripting (or some similar language) is pretty crucial and those skills take time to practice, so that might be a great additional skill to obtain. There are a lot of good online courses.
Have you ever tried to modify PTFE to design an inorganic-organic hybrid complex
No I haven’t, but it sounds interesting. I am more in the machine learning/comp chem side so I don’t know how easy it would be to do in practice.
How well is machine learning in predicting biological activity, and what does this prediction depend mainly on?
So we have been doing “machine learning in chemistry” for decades, the only difference now is that we have a larger toolbox of models that might or might not help build better activity models. QSAR methods in particular have benefited from a lot of optimization and seem to extract almost all the useful predictive power out of affinity data. If deep learning methods beat canonical QSAR approaches depends on who you ask, but in my experience, one is almost never worse off with ChemProp instead of a fingerprint method (though one might not be as much better off as one hopes). Sadly I don’t think we have gotten much better at activity prediction in the last few years, but neural networks also let us do interesting things in QSAR space such as large-scale multitask learning or even federated learning, and I think these approaches will be the standard in future. These methods give us a way to overcome the typically limited amount of affinity data we have for a particular target by bringing in more information. Some other limiting factors apart from dataset size are the quality of the data and the sensitivity to small structural changes (activity cliffs). These are pretty difficult to deal with since all machine learning models are smooth functions and struggle to learn large jumps (as humans sometimes do trying to rationalize SAR).
How many DFT calculations (training points) do you need to parametrize a reliable NN model?
It depends on how dissimilar the new structures you want to predict are from your training data. I would say a few hundred at least to predict static properties, though if you want a machine-learned potential (i.e., a force field) to predict dynamics you might need millions. One nice thing is as long as you keep an eye on uncertainty, you can be selective in which new data points you acquire.
I'm confused about saying you can't do a DFT model to compare on a large number of compounds. There are pretty good and cheap models such as cepC.
Is that a semiempirical method? The main issue is that the systems I was studying are open shells, and generally transition metals are not well-parameterized by these methods which don’t handle spin state ordering. This has implications for bond lengths and redox potentials. Another point is that machine learning methods trained on DFT data can often outcompete semiempirical methods while being much cheaper (once the DFT is done, of course!).
How many DFT calculations would be required to train the neural networks to suggest the next calculation?
In this work, I was doing about 100 DFT calculations each time, but it would be possible to do more or less. From a design of experiments perspective, it is actually optimal to update the model after every calculation, but this is time-consuming and inconvenient so some degree of batching is needed
Is it possible to use machine learning in environmental remediation? For example, in degradation of pollutants using photocatalysts?
Yes, I can’t see why not. machine learning is fairly general, and predicting some property of a certain molecule doesn’t really depend on what that property is (though different methods might be more or less suited to certain tasks). I have seen some work about predicting light-harvesting abilities but I don’t really follow remediation literature.
How could you augment the amount of data you obtain from the DFT so as to train your model?
There are a few neat ideas that can help. One idea I particularly like for chemistry is transfer learning, which is building a model to predict some kind of chemical property from a big database, and then fine-tuning it on some smaller dataset. For example, these authors trained a model on a large number of cheap(er)-to-compute DFT energies and then fine-tuned the model on more expensive coupled-cluster calculations. One can also do ‘unsupervised’ learning, where you train a model to predict some basic thing about the molecule such as hiding one atom or bond and training the model to supply the missing atom or bond label, as in this IBM work. In either case, you hope to teach the model the basic rules of chemistry, and it might be easier to teach it about some specific property with fewer examples (something like it is easier to understand polar molecules are good if you understand polarity). Another idea is data augmentation, which is done for image-based models by rotating or zooming in or out on the images, increasing the number of samples. This can also be applied to chemistry, for example, by writing multiple identical representations of the molecules. Neither case is as good as getting more data but sometimes that is all one can do.
Can we also use machine learning in the band-gap engineering of materials?
This one I can be more confident about: definitely! There are numerous examples. It seems like band gaps are quite easy to predict in many cases.
I find that with machine learning properties using quantum chemistry, the big issue is the lack of data. What are your thoughts on whether you have “enough” data - not just in amount but in variety, especially if using DFT data as opposed to experimental?
One nice thing about using QM data is that we can be really optimal in the choice of what data we acquire (as opposed to experiments where some materials would be very informative but we can’t make them). Picking data points to cover model blind spots can be very effective. Of course, there are still limits, especially for larger or complex molecules. One idea I particularly like for chemistry is transfer learning, which is building a model to predict some kind of chemical property from a big database, and then fine-tuning it on some smaller dataset. For example, these authors trained a model on a large number of cheap(er) DFT energies and then fine-tuned the model on more expensive coupled-cluster calculations. One can also do ‘unsupervised’ learning, where you train a model to predict some basic thing about the molecule such as hiding one atom or bond and training the model to supply the missing atom or bond label, as in this IBM work. In either case, you hope to teach the model the basic rules of chemistry and it is easier to teach it about some specific process or result. Another idea is raw data augmentation, which is done for image-based models by rotating or zooming in or out on the images, increasing the number of samples. This can also be applied to chemistry, for example, by writing multiple identical representations of the molecules. Neither case is as good as getting more data but sometimes that is all one can do.
For this case study, what were the extrapolation capabilities of these machine learning models? I also would like to hear your overall insight on the extrapolation of machine learning methods in general as well, i.e., when the maximum or minimum values are desired and are not available in the training set.
In the case study, the models worked well enough with look-ahead errors (i.e., on the unseen next round of complexes) around 0.2 eV, you can find all the relevant metrics and plots in the paper and SI. While not flawless, the predictions are of sufficient quality to help choose which complexes to study next, so I would say they are fit for purpose. Perhaps that is the best way to look at it as what is good enough for one application might not be sufficient in another. My general feeling is that the community is fairly good at predicting some properties (bandgaps, atomization energies) and worse at some others, such as biological activity (which is obviously a much more complicated quantity). In terms of not having representative training data, that is obviously not a good situation but I think it is almost more important to have some kind of uncertainty metric to warn that you are reaching too far away from training data since even compounds that lie in the range of our training points might have different chemistry and end up on the opposite end of the scale relative to where you would expect based on the training examples.
Is logP really a good surrogate for solubility? Did you consider alternative descriptors?
Possibly it is the best we can do. I think it is a reasonable option given the uncertainties for these systems. Unfortunately, our understanding of the solubility of inorganic complexes is not nearly as good as it is for organics and it is hard to find something unambiguously better that we can access with DFT. Since we depend on QM solvation energies, we at least account for buried vs. exposed polar groups in the ligands in the relevant complex assembly. However, I think that the main contribution of our work is a method that we believe is quite general, instead of a specific prediction or property.
DFT with approximate density functionals is well known to severely fail for high-spin open-shell cases, especially since, in these cases, the ground state electronic structures are multiconfigurational. Is this aspect usually ignored in your process of training the ANNs?
It is a good question and something that we worry about a lot. One way we probe the suitability of hybrid DFT is by routinely varying the fraction of exact exchange (Hartree-Fock exchange, HFX) used, from 0% (pure GGA) to 30%, and measuring how the relative spin state energies are perturbed. It turns out that different materials have different sensitivity to HFX, and we looked at predicting this sensitivity in previous work and even how we could use this information to steer our optimization away from complexes with predicted high sensitivity. Our ANN models can provide predictions at any HFX fraction desired. However, for transition metal complexes DFT will only ever take us so far, and my colleagues have been developing machine learning models to predict the extent of multireference character in complexes before simulating them, allowing us to deploy multireference methods automatically when we sample regions of chemical space that require them. Looking at their data, DFT is actually not all that bad on average though it is sometimes very wrong. This was not used in the case study I showed, so a sensible follow-up would be to screen any ideas with more accurate methods before actually making them. Still, it is a lot easier to do 30 multireference calculations instead of hundreds! For me, I am mostly interested in the optimization algorithm and I think it would be applicable to most things one could compute with DFT.
May I ask your opinion on the direct prediction of electron density?
This has been shown by Kieron Burke and Klaus-Robert Müller among others and there are also orbital-free DFT approaches that predict kinetic energy as a function of density. It looks very neat and could be highly transferable but it is difficult to obtain and manipulate reference data (3D point clouds are a pain). I think I prefer to model the actual endpoint directly, be that energy of the conformation since that is what we are actually trying to estimate since there is only one step in the process (input -> energy) and it is generally a little simpler.
Can you please say the exact properties you considered from DFT for ANN?
We only use DFT for the endpoints, i.e., redox potential and logP, so that we don’t need to do a DFT calculation for a new complex before making a prediction. We computed these quantities from free energy differences between complexes in different oxidation states and in different solvents
In your case study, you mentioned the product of nuclear charges (in shells) as a quantity to investigate. Is this something a human told the algorithms to try, or was this something the algorithm found that lead to better predictions of dG and/or log P?
A bit of both. We started with a large list of possibilities, inspired by our domain knowledge but also trying to be as inclusive as possible, and then used machine learning to select the most relevant variables. It actually depends on which output property you choose, for example, one gets a slightly different result when predicting spin state ordering vs predicting redox potentials. In that case, we could rationalize the finding from a chemical perspective, so we now do this analysis for any new properties to try and gain some insights into which part of the complex are the best targets for smart design (i.e., spin states are strongly controlled by the first shell, while the second shell still contributes a lot to redox potential). You can read all the details in the original paper here. The latest trend with graph convolutional neural networks is to essentially make this more automatic and require less domain input (since we might be wrong), but I think it depends on how much data you have, and especially for the smaller datasets human knowledge can add quite a bit.
Which programming languages are important for machine learning?
Python is the most common and has the most flexible, up-to-date libraries. Under the hood, the libraries call efficient functions written in C/Cpp (usually). But most languages have ‘good enough’ options for most tasks.
How many molecules did you use for training the model?
In the work I showed, I started with a few hundred (~300) up to a few thousand. For these systems, it is difficult to get much more, though organic chemistry datasets are usually a lot bigger.
Your training set was based on DFT, but how well does the DFT training set represent experimental data? How many hits from the machine learning predictions were observed in the lab, and how close were they to redox/solubility predictions?
Heather’s group only works on theory and the focus is to develop algorithms for combining quantum chemistry and machine learning for molecular optimization, so we don’t have the capacity to make the compounds in the lab. However, our approach has been compared to experiments and showed around a 0.3 eV difference. Numbers of solubility are much harder to come by! But I hope the approach is general could be applied to other systems as well.
How do you decide which descriptors to be included for machine learning model training?
The best practice is to try a number of different approaches and use a method such as cross-validation to select the descriptors that give you the lowest generalization risk. However, other factors can be important! If your descriptors are calculated by DFT or depend on experimentally measured properties (say melting point or lattice constant), you might run into a problem using your model on new cases where this data is not available. Also, we generally prefer descriptors that we can understand and explain, even if they give slightly worse performance than others that are harder to visualize and understand.
What functionals and basis set were you using?
We use mainly old-fashioned B3LYP + LANL2DZ, however, we run all of our calculations at a range of different Hartree-Fock exchange fractions from 0 (BLYP) to 0.3, since in our experience the fraction of exact exchange plays more of a role than the form of xc used. It is actually a really interesting thing to study as the sensitivity of spin state ordering to varying HFX – and we have built a model that actually predicts this sensitivity on a case-by-case basis. We have also worked extensively with predicting when DFT will not be appropriate and when we need to use a multireference method, so we can now make these modeling choices in an automatic way!
That is very cool work. In a sense enhanced sampling is related to the active learning I discussed, steering the model to a fruitful area in chemical/conformational space in a data-driven way. machine learning is being applied to every part of a chemical simulation, I look forward to running machine learning-metadynamics on a DFT-accurate neural network potential for a full protein in the near future!
Do you know about chemometrics, and what do you think about it?
It is not something I really work with or see used a lot, but I understand it as combining different measurements and signals to interpret chemical systems. I don’t really have an opinion about it but I have seen a lot of interest in getting machine learning methods more directly involved with understanding spectra (signals) directly, automated peak assignment, etc. I am not sure if you draw a distinction with cheminformatics, which to me is a little broader in scope involving QSAR, chemical database, etc. I think many of these methods use techniques that l would place under the umbrella of machine learning in chemistry in some sense, e.g. PCA or PLS, SVR. Certainly, people have been applying machine learning to chemistry for as long as there has been machine learning.
How can we deal with the lack of samples number if we can’t do more experiments or study more samples? Because machine learning needs a lot of samples to train and validate models. Is it possible to generate data or to use another approach?
There are a few neat ideas that can help. One idea I particularly like for chemistry is transfer learning, which is building a model to predict some kind of chemical property from a big database, and then fine-tuning it on some smaller dataset. For example, these authors trained a model on a large number of cheap(er)-to-compute DFT energies and then fine-tuned the model on more expensive coupled-cluster calculations. One can also do ‘unsupervised’ learning, where you train a model to predict some basic thing about the molecule such as hiding one atom or bond and training the model to supply the missing atom or bond label, as in this IBM work. In either case, you hope to teach the model the basic rules of chemistry and it is easier to teach it about some specific process or result. Another idea is augmentation, which is done for image-based models by rotating or zooming in or out on the images, increasing the number of samples. This can also be applied to chemistry, for example, by writing multiple identical representations of the molecules. Neither case is as good as getting more data but sometimes that is all one can do.
In, for example, PCA decomposition, can machine learning techniques provide any substantive advantages over traditional regression methods? I would say traditional regression methods are a subset of machine learning methods, and sometimes they are still the best choice, especially with limited data. That said, modern deep learning has definitely led to extreme advances in many areas, for example, image classification and natural language processing. In these fields, the difference in performance between traditional methods such as SVR and neural networks is large and indisputable. In chemical sciences, modern deep learning methods are allowing us to solve different kinds of problems compared to what we could do with traditional ‘shallow’ models, for example, neural network potentials, synthesis planning, and protein structure prediction. In these cases, they are not really competing with other methods as much as solving different problems. For some chemistry problems, such as QSAR, shallow methods are competitive or slightly worse than neural networks, though the difference is not large and we don’t really know why – it is a very difficult problem for machine learning because small changes (e.g. magic methyls) can lead to huge changes in the output, and usually the amount of data is not that much (the image classification people have millions of images to learn from). In regards to PCA specifically, as a dimensionality reduction technique, there are modern machine learning alternatives such as t-SNE or UMAP, which in my opinion tend to work better but there may be some cases where PCA is preferred, particularly because it can be easier to interpret and faster for large datasets. Since these methods are non-linear, they can recover patterns in the data that PCA cannot – here is a great online demo of t-SNE.
How do we design our training set?
A good training set should cover the intended chemical space where we will apply the model. In other words, it should be diverse and look like the data we want to predict. For example, if you are predicting the solubility of molecules, I would want to have a variety of polar groups and also some weakly soluble compounds. If your training data doesn’t include halogens, then you probably shouldn’t try and predict on data that contains them. Having good coverage of the space is more important than the raw number. Practically, one usually takes all the data one can get, but if you have the option to choose which data to collect, clustering and picking one from each cluster is probably a good idea.
What software is needed to do this type of machine learning?
I use pytorch, a python library for machine learning. Generally, there is a huge number of open source python packages for machine learning and many papers also make their code available. I have also historically used R, which is great for smaller datasets and simpler models but lags behind for complex neural networks.
Is there a principled or optimal way to identify the model family that will lead to the lowest true risk when properly fitted without having to train them all? Do you have advice for this?
In general, there is no way to know for sure which model will have the lowest (estimated) true risk without trying them all. However, you might be able to logically refine the search. For example, let us assume you fit 3 neural networks, one with 10 layers, one with 5 layers, and one with 2 layers. When you estimate the risk using cross-validation (CV), you determine that the best model has 2 layers, the next best has 5 and the 10 layer model is the worst since it is very overfitting (side note, this would manifest as the training error is near zero while the CV error is high). For your next attempt, it makes more sense to try models with 1, 3, or 4 layers instead of 20 or 200, since it seems like the 10-layer model is too complicated. In this way, you can iteratively refine the model search space, and this forms the basis of most hyperparameter search algorithms – a famous one is hyperopic. In practice, the search space is not 1D and the dimensions are NOT independent, but the idea is the same. They start with some random trials and then try and predict which areas of the search space are more productive. For complex models, this is the way to go, for simpler models and smaller search spaces a grid search is usually good enough.
Can you please suggest very basic books in machine learning for chemists?
An excellent machine learning book (not chem specific) that starts with the basics is Elements of Statistical Learning. That is where I started and I still use it! There is a pdf version on the Stanford website. Otherwise, the book Heather and I wrote is aimed at exactly this audience!
What Python libraries do you use the most?
PyTorch, RDKit, Pandas. I got a lot of use out of OpenBabel and Keras/Tensorflow previously.
How good do you need to be at math to learn machine learning?
It depends on what you want to learn. You can apply machine learning methods (and get good results) without really understanding how they work – many people who run computational chemistry calculations don’t necessarily know all the details of the simulations they run. However, getting a basic understanding of how machine learning works will obviously put you at an advantage. The good news is to understand and apply essentially all machine learning methods you only need some linear algebra and multivariable calculus. Even “how” neural networks work is mathematically simple – they are just collections of many neurons, each of which is pretty easy to understand. That said, there are more theoretical aspects such as how models generalize (or not), “why” large neural networks work so well, which involve some more complicated mathematics (mostly probability theory).
Can machine learning be used in all areas of chemistry, including inorganic and physical chemistry? What is the greatest difficulty in using it in a new area?
I think almost all scientific fields generate data in some form or another, and machine learning is really about if the patterns in that data can be used to predict future outcomes and so it can be applied in all these different applications. One issue in a new domain is lack of precedent as none of the existing descriptors or methods might represent your problem well – but this is also a great opportunity for domain knowledge to help set the machine learning problem up correctly! It can also be difficult to get enough data in specialized areas.
When the construction of metamodels that replace the usual numerical methods is implemented, what is the reduction in computation time in the calculations?
It depends on the model and what it is replacing. In the work I showed, I am swapping DFT calculations that take hours for machine learning models that take fractions of a second. To evaluate my full design space with DFT would have taken ~50 GPU-years, while the machine learning model takes ~4 seconds on a laptop. However, when comparing, for example, classical mechanics/force fields with neural networks, things are a lot closer. It is not my area but I believe the current neural network potentials tend to be slightly more expensive than force fields, although they give accuracy more comparable to DFT. An example paper where this is discussed is here. Some of the more complicated machine learning models could be considerably slower, but I don’t think there is any machine learning model that takes more than a few seconds per compound. Actually, classical machine learning methods such as SVR are often a lot slower than much larger neural networks, especially for large numbers of predictions, because neural networks typically scale linearly (while kernel inversion is cubic in principle).
Biomass decomposition is very complex. Can machine learning be used to carefully predict rate data of the decomposition of biomass, especially when we have about 50 data sets?
I doubt anyone knows – it depends on how the data points are represented and how ‘predictable’ they are. That sounds a bit vague so let me try to explain. One can usually fit Arrhenius rate expressions to reaction rates at a handful of temperatures. The reason is that the relationship between ‘x’ (inverse temperature) and ‘y’ (log rate) is very smooth (ideally linear). In the same way, the difficulty in predicting something comes down to how directly and smoothly the representation (x value) is related to the rate. For smaller data, how you represent them can be very important. Why not try it – especially with only 50 points you should probably use a very simple model, such as LASSO or ridge regression, and see what the results look like?
Compared with big data and other kinds of data, chemical data is a little bit small and expensive to enlarge. Which machine learning strategies can be employed to advance in this field?
While many types of chemical data are scarce, that is not always the case. There are publically available datasets of tens of millions of DFT calculations, millions of reactions mined from patents, and large databases like the CSD, materials project, open materials database, QOMD, PDB etc. Also, the amount of data needed can depend a lot on what you are trying to achieve – it may be that small data can be quite useful as long as it is similar to what you want to predict. However, in some cases, there is not enough data and one neat idea is transfer learning, i.e., building a model on one of these abundant data sources and then fine-tuning it on smaller datasets.
That was exactly the point – Jon uses terms such as minimal data and lots of data but it is hard to judge what minimal and lots actually mean. It is difficult to define how much data is ‘enough’ since different applications have different requirements and some datasets are just easier than others. As a toy example, consider fitting rate expressions to an Arrhenius plot (this is nothing other than a simple linear regression problem). You can probably get good kinetic parameters from a handful of reactions because the relationship between 1/T (the x variable) and log Rate (the y variable) is simple and smooth. It doesn’t need a very complex regression function to describe this data and so we don’t need much. In principle, the same applies to more complicated cases – if the way you represent your data, the x values, is smoothly correlated with what you are trying to predict, it is much easier. So it is not only a function of the amount of data but also how you represent it, and how accurate a model needs to be useful. In drug discovery, even a very weak model that is correct 10% of the time is very useful, since getting one active molecule for ten tries is a really good result. If you really want to put a number on things, I wouldn’t look too much into datasets with <50 points unless you can basically see the model by eye (as in the Arrhenius case), and I would only go to ‘deep’ models like neural networks with at least a few hundred (however, many classical algorithms such as LASSO could be very useful in that range).
Do you know of a community (Stack, Facebook group/page) to join and interact about machine learning?
I find the machine learning subreddit is very lively and even well-known people in the machine learning community such as Yoshua Bengio participate occasionally.
Is it possible to make predictions of free energy in reactions?
Yes, here is an example paper predicting barriers – they even make their trained model freely available so you could try it. In general, with enough data, we expect to predict any physical quantity. However, you can notice that they use quite a large quantum chemical dataset to train the model.