e-Books Archives - ACS Axial | ACS Publications
Search
Close

Learning to Communicate Inclusively: A New ACS Guide Chapter 

When done right, communication can open doors—it allows people to learn new concepts, meet one another, and share information. But often, we can unintentionally close doors with our communication when unconscious biases appear in the words and images we use. Language and images that alienate groups or perpetuate stereotypes create barriers between a communicator and their potential audience.

To break down those barriers and help advance a more diverse and inclusive culture in science, the ACS Guide to Scholarly Communication has added a new open-access chapter on inclusive language and images. The latest chapter is the ACS Inclusivity Style Guide, a resource developed by the American Chemical Society Communications Division and the Office of Diversity, Equity, Inclusion & Respect. Using accessible language and real-life examples, the new chapter helps readers learn to communicate in ways that recognize and respect diversity in all its forms.

The guidelines can be applied to all content to make it more welcoming and relevant, regardless of the topic. The chapter includes recommended language on gender and sexuality, race and ethnicity, disabilities and disorders, and more. It offers important context for each topic, including the background behind each recommendation and links to valuable resources.  Examples of topics the guide discusses include:

  • when to use the description “people of color,”
  • when to use the singular pronoun “they,” and
  • when to use people-first or identity-first language for health conditions.

The guide is primarily based on recommendations from advocacy and journalistic groups. Because language is ever-evolving, the guide will be updated over time to reflect changes in language and to incorporate new topics.

The ACS Guide to Scholarly Communication provides students, researchers, educators, and librarians with the instruction and advice they need to master the art of scholarly communication beyond the scientific journal. With the valuable guidance and examples provided in this newest ACS Guide chapter, readers can learn how to keep communication opening, not closing, doors.

To give feedback on this chapter of the guide, please email ISG@acs.org.

Join ACS Publications at the 2021 AIChE Annual Meeting

ACS Publications is pleased to present its portfolio of journals and resources across the chemical engineering, applied, and material sciences at the 2021 AIChE Annual Meeting.

The AIChE Annual Meeting is an educational forum for chemical engineers interested in innovation and professional growth. Academic and industry experts cover a wide range of topics relevant to cutting-edge research, new technologies, and emerging growth areas in chemical engineering. This year AIChE will be taking place over two weeks to accommodate an in-person meeting in Boston from November 7-11 and a virtual meeting from November 15-19.

ACS proudly highlights its broad resources across the chemical engineering enterprise, including a number of highly cited journals, special issues, virtual issues, digital books and ebooks, on-demand courses, and more!

I&EC Research

At the forefront of chemical engineering research since 1909, I&EC Research continues to be the journal of choice for a global community of authors and readers.


ACS Polymer journals

ACS Macro Letters, Biomacromolecules, and Macromolecules are the ACS journals of broad interest in polymer science and related, cross-disciplinary macromolecular research.


ACS Applied Polymer Materials

An interdisciplinary journal publishing original research covering all aspects of engineering, chemistry, physics, and biology relevant to applications of polymers.


JACS Au

Significant open-access reports from analytical, physical, inorganic, organic, medicinal, environmental, catalytic, and theoretical chemistry research.


ACS ES&T Engineering

ACS ES&T Engineering publishes high-impact research articles and reviews/perspectives in all areas of environmental technology and engineering.


ACS Biomaterials Science & Engineering

ACS Biomaterials Science & Engineering reports new work in biomaterials science, technology, and engineering research and applications to biomedicine, bioengineering, and clinical translation.


ACS Sensors

ACS Sensors is a peer-reviewed research journal that is devoted to the dissemination of new and original knowledge on all aspects of sensor science that selectively sense chemical or biological species or processes.


Analytical Chemistry

Analytical Chemistry is a peer-reviewed research journal that is devoted to the dissemination of new and original knowledge in all branches of analytical chemistry.


ACS Chemical Health & Safety

Reports on safety, risk assessment, regulatory updates, chemical hygiene practice, hazard assessment, and safe laboratory design and operations.

*These articles are published as part of the forthcoming “Process Safety from Bench to Pilot to Plant” joint Virtual Special Issue from ACS CHAS, Organic Process Research & Development, and Journal of Loss Prevention in the Process Industries.


ACS In Focus Digital Books

Supporting graduate students and scientists seeking to accelerate their fundamental understanding of emerging topics and techniques from across the sciences, ACS In Focus are digital publications brought to life through videos, animations, and molecular models delivered in an e-reader that enables you to learn when it suits you, on or offline.


ACS Reagent Chemicals

The Must-Have Reference Guide for Analytical, Industrial, & Research Labs. ACS Reagent Chemicals contains purity specifications for almost 500 reagent chemicals and more than 500 standard-grade reference materials in an easy-to-use online format.


Sign-up for the latest news related to chemical engineering and industrial chemistry from ACS Publications!

The Nobel Prize in Physics 2021 Goes to Syukuro Manabe, Klaus Hasselmann, and Giorgio Parisi

The Nobel Prize in Physics 2021 honors scientists who have made important contributions “to our understanding of complex systems” in two distinct ways. Half of the prize is awarded jointly to Professor Syukuro Manabe and Professor Klaus Hasselmann “for the physical modeling of Earth’s climate, quantifying variability and reliably predicting global warming.” The other half goes to Professor Giorgio Parisi “for the discovery of the interplay of disorder and fluctuations in physical systems from atomic to planetary scales.”

>All the recipients of this year’s prize have helped us understand systems characterized by randomness and disorder. Manabe’s work led the development of physical climate modeling and our understanding of the link between rising carbon dioxide levels in the air and higher surface temperatures. Hasselmann’s research established the link between climate and weather, showing that the chaotic fluctuations of weather patterns don’t impact the predictability of climate models. His work became the basis for showing that our atmosphere is warming because of human-caused increases in CO2 levels. Parisi discovered that hidden rules that guide the behavior of seemingly random systems, such as spin glasses, and was able to mathematically describe such hidden structures. His work is essential to our understanding of how simple individual behaviors can lead to complex collective behaviors, found in everything from the occurrence of ice ages to the flight of starlings.

Professor Giorgio Parisi published several papers in ACS Publications journals over the years, as well as a chapter of an ebook. The following publications are free-to-read for 30 days, starting October 5, 2021.

Exact Theory of Dense Amorphous Hard Spheres in High Dimension. II. The High Density Regime and the Gardner Transition
J. Phys. Chem. B 2013, 117, 42, 12979–12994
DOI: 10.1021/jp402235d
***
An Increasing Correlation Length in Off-Equilibrium Glasses
J. Phys. Chem. B 1999, 103, 20, 4128–4131
DOI: 10.1021/jp983967m
***
The Replica Approach to Glasses
ACS Symposium Series Vol. 676
DOI: 10.1021/bk-1997-0676.ch008

Read more about the winners in C&EN.

Get Up to Speed Quickly on Emerging Topics

The ACS In Focus digital books help readers of all levels accelerate their fundamental understanding of emerging topics and techniques from across the sciences. In an instructional setting, these works bridge the gap between textbooks and literature. For seasoned scientists, they satisfy the hunger for continuous growth in knowledge and capability.

ACS In Focus is a range of digital publications brought to life through videos, animations, and molecular models delivered in an e-reader that enables you to learn when it suits you, on or offline.

The ACS In Focus series is organized by and may be purchased in collections. The Inaugural Collection, comprised of 10 books, is now complete and includes titles such as Machine Learning in Chemistry and Science & Public Policy.

Future collections will be available in sets of 20 e-books. Browse the series page to view current and upcoming books, such as currently available titles from Collection 1, Astrochemistry, and Virtual Screening for Chemists.

Find out more and explore access options at the ACS Solutions Center.

 

Machine Learning in Chemistry: Now and in the Future

ACS In Focus recently held a virtual event on “Machine Learning in Chemistry: Now and in the Future” with Jon Paul Janet, Senior Scientist at AstraZeneca and co-author of the ACS In Focus Machine Learning in Chemistry e-book.

This event had a brief discussion of Dr. Janet’s ACS In Focus e-book, a conversation on the future of machine learning, and a presentation on the exciting research Dr. Janet and his colleagues have recently done using machine learning to accelerate the search for new materials.

Below you can watch the recording of the webinar and view some questions your colleagues asked.

View the Webinar Recording:

Interested in learning how to get access to Machine Learning in Chemistry? Talk to your librarian today!

Read Dr. Janet’s Answers to Community Questions

As a beginner without prior knowledge of programming language, how can one go into this field? What are the prerequisites one needs to acquire for this field?

I think learning some basic python scripting is the best way to get started, because there is a great community and tons of tools that can help make trying machine learning on chemical problems easy – sklearn and RDKit are amazing and get you quite far. But these need some familiarity with scripting at least.

How do you compare machine learning with quantitative structure-activity relationship (QSAR) modeling, which has been around for 30 years?

I would say “QSAR modeling” is a label for a specific type of machine learning (activity prediction based on molecular structure). Many QSAR techniques such as SVR/SVM and random forest are components of traditional/shallow machine learning, so we have been doing “machine learning in chemistry” for decades and in my experience, traditional QSAR methods are mostly competitive with deep learning approaches for affinity prediction especially. But I don’t think that is the whole story – new machine learning methods are letting us solve new types of problems such as generative models, retrosynthesis prediction, massive multitask predictions, etc. that don’t fit neatly into the QSAR label. In these cases, they are not really competing with other methods as much as expanding the type of data we can use and using it in new ways. If deep learning methods beat canonical QSAR approaches depends on who you ask, but in my experience, one is almost never worse off with ChemProp instead of a fingerprint method (though one might not be as much better off as one hopes). Neural networks also let us do interesting things in QSAR space such as multitask learning or even federated learning, and I think these approaches will be the standard in the future.

Can we simulate the effect of magnetized water clusterization through machine learning tools?

I don’t know much about water magnetization, but I know there has been some work in simulating the behavior of water clusters with neural networks. I am not sure if anyone has used these methods to infer magnetic properties though!

What are the job prospects for someone with a background in chemistry and computer science if he/she wants expertise in machine learning in the context of computer science?

Good question. There is definitely interest in industry, both in pharmaceuticals but also increasingly in materials design, including lots of startups in the last few years. I think there are roles for people with both more computer science experience and more chemistry experience. I think ideally a team would be comprised of people with both backgrounds.

What would be more appropriate machine learning techniques in de novo drug discovery?

This is what I work on now and it is an open question. I think SMILES-based generative models seem to be performant enough that more complex graph-generating methods don’t seem to pay off. I think the biggest issue is honestly getting useful scoring functions, so incorporating more multitask QSAR and physics-based/assisted approaches.

Do you envision a phase of disappointment after the current hype of machine learning, similar to what has happened with other promising technologies?

You can read about the “AI winter” of the 1980s. If you haven’t: machine learning has had this hype-crash cycle before. I think the current hype around general machine learning comes from a few high-profile technologies of the last decade-ish: convolutional neural nets for images, recurrent networks (and now transformers) for text, and advances in reinforcement learning for game playing. Each of these has followed in rapid progression and made a big difference in their respective domains and helped maintain and build hype. Without another flashy advance, I think we will see another slowdown. I am not sure we will see such a large ‘crash’ per se but I foresee more of a quieting down of interest, in some ways that might already be happening with, for example, generative models.

Does machine learning have applications in the design-of-experiment optimization process?

I would say machine learning is a big umbrella and yes, but I don’t put general optimization under machine learning. Instead, many of the DOE methods, at least the ones I use, depend on an explicit surrogate to predict how the objective function behaves on unseen points. Training of these surrogates, which are usually Gaussian processes or related, is a machine learning task. Maybe the DOE people would say the algorithmic choice of how you use this information is something else (i.e., not machine learning), but I don’t think the label really matters.

How could machine learning help ab initio algorithm minimize calculation times?

A whole lot of ways! One idea is to build a neural network to predict density functional theory (DFT) energies as a function of the structure, then you can run the simulation on a neural network potential, and only call the DFT to check on, and update, the potential as needed. Other clever people have used machine learning directly integrated into the hamiltonian, or to predict which orbital pairs to keep in a post-HF method. You can also accelerate geometry optimizations by bootstrapping a surrogate model at each step. In our work, we used machine learning models to construct good-quality starting geometries for our calculations, which reduces the number of optimization steps needed. We have also looked into using machine learning to predict when multireference methods are needed (vs. single-determinant DFT), which can help save a lot of time!

Do you have any comments or suggestions for the prediction of biological activity data? How do the new methodologies perform vs. QSAR models?

So we have been doing “machine learning in chemistry” for decades, the only difference now is that we have a larger toolbox of models that might or might not help build better activity models. QSAR methods in particular have benefited from a lot of optimization and seem to extract almost all the useful predictive power out of affinity data. If deep learning methods beat canonical QSAR approaches depends on who you ask, but in my experience, one is almost never worse off with ChemProp instead of a fingerprint method (though one might not be as much better off as one hopes). Neural networks also let us do interesting things in QSAR space such as large-scale multitask learning or even federated learning, and I think these approaches will be the standard in the future. These methods give us a way to overcome the typically limited amount of affinity data we have for a particular target by bringing in more information.

How accurate can the latest machine learning methods be in predicting the possible starting synthons for a designed novel molecule or polymer material?

I can only really comment that for drug-sized organic molecules we can actually make reasonable synthon predictions in many cases, e.g. machine learningPDS tools are used in the real world every day. As for polymers, I think there is a lot less published so I am not sure. In principle, yes, but it will depend on the data that is available. In my experience, these methods are extremely reliable at finding feasible, commercially available synthons for common disconnects (amide bonds, etc) but a little less reliable for more exotic chemistry. This is pretty cool because it shifts some of the busy work of synthetic chemists to focus on more interesting problems!

I am an organic chemistry Ph.D. student. I have started learning the basics of machine learning. Is it possible to work in the machine learning chemistry field in postdoctoral research, or is it already too late?

Definitely not! My Ph.D. group worked with a number of postdocs from purely chemistry backgrounds and there is a lot of domain experience that you gain from a Ph.D. that can be useful in applying machine learning methods. Probably I would recommend trying to join a group that does machine learning so you can learn from them. That said, being comfortable with python scripting (or some similar language) is pretty crucial and those skills take time to practice, so that might be a great additional skill to obtain. There are a lot of good online courses.

Have you ever tried to modify PTFE to design an inorganic-organic hybrid complex

No I haven’t, but it sounds interesting. I am more in the machine learning/comp chem side so I don’t know how easy it would be to do in practice.

How well is machine learning in predicting biological activity, and what does this prediction depend mainly on?

So we have been doing “machine learning in chemistry” for decades, the only difference now is that we have a larger toolbox of models that might or might not help build better activity models. QSAR methods in particular have benefited from a lot of optimization and seem to extract almost all the useful predictive power out of affinity data. If deep learning methods beat canonical QSAR approaches depends on who you ask, but in my experience, one is almost never worse off with ChemProp instead of a fingerprint method (though one might not be as much better off as one hopes). Sadly I don’t think we have gotten much better at activity prediction in the last few years, but neural networks also let us do interesting things in QSAR space such as large-scale multitask learning or even federated learning, and I think these approaches will be the standard in future. These methods give us a way to overcome the typically limited amount of affinity data we have for a particular target by bringing in more information. Some other limiting factors apart from dataset size are the quality of the data and the sensitivity to small structural changes (activity cliffs). These are pretty difficult to deal with since all machine learning models are smooth functions and struggle to learn large jumps (as humans sometimes do trying to rationalize SAR).

How many DFT calculations (training points) do you need to parametrize a reliable NN model?

It depends on how dissimilar the new structures you want to predict are from your training data. I would say a few hundred at least to predict static properties, though if you want a machine-learned potential (i.e., a force field) to predict dynamics you might need millions. One nice thing is as long as you keep an eye on uncertainty, you can be selective in which new data points you acquire.

I'm confused about saying you can't do a DFT model to compare on a large number of compounds. There are pretty good and cheap models such as cepC.

Is that a semiempirical method? The main issue is that the systems I was studying are open shells, and generally transition metals are not well-parameterized by these methods which don’t handle spin state ordering. This has implications for bond lengths and redox potentials. Another point is that machine learning methods trained on DFT data can often outcompete semiempirical methods while being much cheaper (once the DFT is done, of course!).

How many DFT calculations would be required to train the neural networks to suggest the next calculation?

In this work, I was doing about 100 DFT calculations each time, but it would be possible to do more or less. From a design of experiments perspective, it is actually optimal to update the model after every calculation, but this is time-consuming and inconvenient so some degree of batching is needed

Is it possible to use machine learning in environmental remediation? For example, in degradation of pollutants using photocatalysts?

Yes, I can’t see why not. machine learning is fairly general, and predicting some property of a certain molecule doesn’t really depend on what that property is (though different methods might be more or less suited to certain tasks). I have seen some work about predicting light-harvesting abilities but I don’t really follow remediation literature.

How could you augment the amount of data you obtain from the DFT so as to train your model?

There are a few neat ideas that can help. One idea I particularly like for chemistry is transfer learning, which is building a model to predict some kind of chemical property from a big database, and then fine-tuning it on some smaller dataset. For example, these authors trained a model on a large number of cheap(er)-to-compute DFT energies and then fine-tuned the model on more expensive coupled-cluster calculations. One can also do ‘unsupervised’ learning, where you train a model to predict some basic thing about the molecule such as hiding one atom or bond and training the model to supply the missing atom or bond label, as in this IBM work. In either case, you hope to teach the model the basic rules of chemistry, and it might be easier to teach it about some specific property with fewer examples (something like it is easier to understand polar molecules are good if you understand polarity). Another idea is data augmentation, which is done for image-based models by rotating or zooming in or out on the images, increasing the number of samples. This can also be applied to chemistry, for example, by writing multiple identical representations of the molecules. Neither case is as good as getting more data but sometimes that is all one can do.

Can we also use machine learning in the band-gap engineering of materials?

This one I can be more confident about: definitely! There are numerous examples. It seems like band gaps are quite easy to predict in many cases.

I find that with machine learning properties using quantum chemistry, the big issue is the lack of data. What are your thoughts on whether you have “enough” data - not just in amount but in variety, especially if using DFT data as opposed to experimental?

One nice thing about using QM data is that we can be really optimal in the choice of what data we acquire (as opposed to experiments where some materials would be very informative but we can’t make them). Picking data points to cover model blind spots can be very effective. Of course, there are still limits, especially for larger or complex molecules. One idea I particularly like for chemistry is transfer learning, which is building a model to predict some kind of chemical property from a big database, and then fine-tuning it on some smaller dataset. For example, these authors trained a model on a large number of cheap(er) DFT energies and then fine-tuned the model on more expensive coupled-cluster calculations. One can also do ‘unsupervised’ learning, where you train a model to predict some basic thing about the molecule such as hiding one atom or bond and training the model to supply the missing atom or bond label, as in this IBM work. In either case, you hope to teach the model the basic rules of chemistry and it is easier to teach it about some specific process or result. Another idea is raw data augmentation, which is done for image-based models by rotating or zooming in or out on the images, increasing the number of samples. This can also be applied to chemistry, for example, by writing multiple identical representations of the molecules. Neither case is as good as getting more data but sometimes that is all one can do.

For this case study, what were the extrapolation capabilities of these machine learning models? I also would like to hear your overall insight on the extrapolation of machine learning methods in general as well, i.e., when the maximum or minimum values are desired and are not available in the training set.

In the case study, the models worked well enough with look-ahead errors (i.e., on the unseen next round of complexes) around 0.2 eV, you can find all the relevant metrics and plots in the paper and SI. While not flawless, the predictions are of sufficient quality to help choose which complexes to study next, so I would say they are fit for purpose. Perhaps that is the best way to look at it as what is good enough for one application might not be sufficient in another. My general feeling is that the community is fairly good at predicting some properties (bandgaps, atomization energies) and worse at some others, such as biological activity (which is obviously a much more complicated quantity). In terms of not having representative training data, that is obviously not a good situation but I think it is almost more important to have some kind of uncertainty metric to warn that you are reaching too far away from training data since even compounds that lie in the range of our training points might have different chemistry and end up on the opposite end of the scale relative to where you would expect based on the training examples.

Is logP really a good surrogate for solubility? Did you consider alternative descriptors?

Possibly it is the best we can do. I think it is a reasonable option given the uncertainties for these systems. Unfortunately, our understanding of the solubility of inorganic complexes is not nearly as good as it is for organics and it is hard to find something unambiguously better that we can access with DFT. Since we depend on QM solvation energies, we at least account for buried vs. exposed polar groups in the ligands in the relevant complex assembly. However, I think that the main contribution of our work is a method that we believe is quite general, instead of a specific prediction or property.

DFT with approximate density functionals is well known to severely fail for high-spin open-shell cases, especially since, in these cases, the ground state electronic structures are multiconfigurational. Is this aspect usually ignored in your process of training the ANNs?

It is a good question and something that we worry about a lot. One way we probe the suitability of hybrid DFT is by routinely varying the fraction of exact exchange (Hartree-Fock exchange, HFX) used, from 0% (pure GGA) to 30%, and measuring how the relative spin state energies are perturbed. It turns out that different materials have different sensitivity to HFX, and we looked at predicting this sensitivity in previous work and even how we could use this information to steer our optimization away from complexes with predicted high sensitivity. Our ANN models can provide predictions at any HFX fraction desired. However, for transition metal complexes DFT will only ever take us so far, and my colleagues have been developing machine learning models to predict the extent of multireference character in complexes before simulating them, allowing us to deploy multireference methods automatically when we sample regions of chemical space that require them. Looking at their data, DFT is actually not all that bad on average though it is sometimes very wrong. This was not used in the case study I showed, so a sensible follow-up would be to screen any ideas with more accurate methods before actually making them. Still, it is a lot easier to do 30 multireference calculations instead of hundreds! For me, I am mostly interested in the optimization algorithm and I think it would be applicable to most things one could compute with DFT.

May I ask your opinion on the direct prediction of electron density?

This has been shown by Kieron Burke and Klaus-Robert Müller among others and there are also orbital-free DFT approaches that predict kinetic energy as a function of density. It looks very neat and could be highly transferable but it is difficult to obtain and manipulate reference data (3D point clouds are a pain). I think I prefer to model the actual endpoint directly, be that energy of the conformation since that is what we are actually trying to estimate since there is only one step in the process (input -> energy) and it is generally a little simpler.

Can you please say the exact properties you considered from DFT for ANN?

We only use DFT for the endpoints, i.e., redox potential and logP, so that we don’t need to do a DFT calculation for a new complex before making a prediction. We computed these quantities from free energy differences between complexes in different oxidation states and in different solvents

In your case study, you mentioned the product of nuclear charges (in shells) as a quantity to investigate. Is this something a human told the algorithms to try, or was this something the algorithm found that lead to better predictions of dG and/or log P?

A bit of both. We started with a large list of possibilities, inspired by our domain knowledge but also trying to be as inclusive as possible, and then used machine learning to select the most relevant variables. It actually depends on which output property you choose, for example, one gets a slightly different result when predicting spin state ordering vs predicting redox potentials. In that case, we could rationalize the finding from a chemical perspective, so we now do this analysis for any new properties to try and gain some insights into which part of the complex are the best targets for smart design (i.e., spin states are strongly controlled by the first shell, while the second shell still contributes a lot to redox potential). You can read all the details in the original paper here. The latest trend with graph convolutional neural networks is to essentially make this more automatic and require less domain input (since we might be wrong), but I think it depends on how much data you have, and especially for the smaller datasets human knowledge can add quite a bit.

Which programming languages are important for machine learning?

Python is the most common and has the most flexible, up-to-date libraries. Under the hood, the libraries call efficient functions written in C/Cpp (usually). But most languages have ‘good enough’ options for most tasks.

How many molecules did you use for training the model?

In the work I showed, I started with a few hundred (~300) up to a few thousand. For these systems, it is difficult to get much more, though organic chemistry datasets are usually a lot bigger.

Your training set was based on DFT, but how well does the DFT training set represent experimental data? How many hits from the machine learning predictions were observed in the lab, and how close were they to redox/solubility predictions?

Heather’s group only works on theory and the focus is to develop algorithms for combining quantum chemistry and machine learning for molecular optimization, so we don’t have the capacity to make the compounds in the lab. However, our approach has been compared to experiments and showed around a 0.3 eV difference. Numbers of solubility are much harder to come by! But I hope the approach is general could be applied to other systems as well.

How do you decide which descriptors to be included for machine learning model training?

The best practice is to try a number of different approaches and use a method such as cross-validation to select the descriptors that give you the lowest generalization risk. However, other factors can be important! If your descriptors are calculated by DFT or depend on experimentally measured properties (say melting point or lattice constant), you might run into a problem using your model on new cases where this data is not available. Also, we generally prefer descriptors that we can understand and explain, even if they give slightly worse performance than others that are harder to visualize and understand.

What functionals and basis set were you using?

We use mainly old-fashioned B3LYP + LANL2DZ, however, we run all of our calculations at a range of different Hartree-Fock exchange fractions from 0 (BLYP) to 0.3, since in our experience the fraction of exact exchange plays more of a role than the form of xc used. It is actually a really interesting thing to study as the sensitivity of spin state ordering to varying HFX – and we have built a model that actually predicts this sensitivity on a case-by-case basis. We have also worked extensively with predicting when DFT will not be appropriate and when we need to use a multireference method, so we can now make these modeling choices in an automatic way!


That is very cool work. In a sense enhanced sampling is related to the active learning I discussed, steering the model to a fruitful area in chemical/conformational space in a data-driven way. machine learning is being applied to every part of a chemical simulation, I look forward to running machine learning-metadynamics on a DFT-accurate neural network potential for a full protein in the near future!

Do you know about chemometrics, and what do you think about it?

It is not something I really work with or see used a lot, but I understand it as combining different measurements and signals to interpret chemical systems. I don’t really have an opinion about it but I have seen a lot of interest in getting machine learning methods more directly involved with understanding spectra (signals) directly, automated peak assignment, etc. I am not sure if you draw a distinction with cheminformatics, which to me is a little broader in scope involving QSAR, chemical database, etc. I think many of these methods use techniques that l would place under the umbrella of machine learning in chemistry in some sense, e.g. PCA or PLS, SVR. Certainly, people have been applying machine learning to chemistry for as long as there has been machine learning.

How can we deal with the lack of samples number if we can’t do more experiments or study more samples? Because machine learning needs a lot of samples to train and validate models. Is it possible to generate data or to use another approach?

There are a few neat ideas that can help. One idea I particularly like for chemistry is transfer learning, which is building a model to predict some kind of chemical property from a big database, and then fine-tuning it on some smaller dataset. For example, these authors trained a model on a large number of cheap(er)-to-compute DFT energies and then fine-tuned the model on more expensive coupled-cluster calculations. One can also do ‘unsupervised’ learning, where you train a model to predict some basic thing about the molecule such as hiding one atom or bond and training the model to supply the missing atom or bond label, as in this IBM work. In either case, you hope to teach the model the basic rules of chemistry and it is easier to teach it about some specific process or result. Another idea is augmentation, which is done for image-based models by rotating or zooming in or out on the images, increasing the number of samples. This can also be applied to chemistry, for example, by writing multiple identical representations of the molecules. Neither case is as good as getting more data but sometimes that is all one can do.

In, for example, PCA decomposition, can machine learning techniques provide any substantive advantages over traditional regression methods? I would say traditional regression methods are a subset of machine learning methods, and sometimes they are still the best choice, especially with limited data. That said, modern deep learning has definitely led to extreme advances in many areas, for example, image classification and natural language processing. In these fields, the difference in performance between traditional methods such as SVR and neural networks is large and indisputable. In chemical sciences, modern deep learning methods are allowing us to solve different kinds of problems compared to what we could do with traditional ‘shallow’ models, for example, neural network potentials, synthesis planning, and protein structure prediction. In these cases, they are not really competing with other methods as much as solving different problems. For some chemistry problems, such as QSAR, shallow methods are competitive or slightly worse than neural networks, though the difference is not large and we don’t really know why – it is a very difficult problem for machine learning because small changes (e.g. magic methyls) can lead to huge changes in the output, and usually the amount of data is not that much (the image classification people have millions of images to learn from). In regards to PCA specifically, as a dimensionality reduction technique, there are modern machine learning alternatives such as t-SNE or UMAP, which in my opinion tend to work better but there may be some cases where PCA is preferred, particularly because it can be easier to interpret and faster for large datasets. Since these methods are non-linear, they can recover patterns in the data that PCA cannot – here is a great online demo of t-SNE.

How do we design our training set?

A good training set should cover the intended chemical space where we will apply the model. In other words, it should be diverse and look like the data we want to predict. For example, if you are predicting the solubility of molecules, I would want to have a variety of polar groups and also some weakly soluble compounds. If your training data doesn’t include halogens, then you probably shouldn’t try and predict on data that contains them. Having good coverage of the space is more important than the raw number. Practically, one usually takes all the data one can get, but if you have the option to choose which data to collect, clustering and picking one from each cluster is probably a good idea.

What software is needed to do this type of machine learning?

I use pytorch, a python library for machine learning. Generally, there is a huge number of open source python packages for machine learning and many papers also make their code available. I have also historically used R, which is great for smaller datasets and simpler models but lags behind for complex neural networks.

Is there a principled or optimal way to identify the model family that will lead to the lowest true risk when properly fitted without having to train them all? Do you have advice for this?

In general, there is no way to know for sure which model will have the lowest (estimated) true risk without trying them all. However, you might be able to logically refine the search. For example, let us assume you fit 3 neural networks, one with 10 layers, one with 5 layers, and one with 2 layers. When you estimate the risk using cross-validation (CV), you determine that the best model has 2 layers, the next best has 5 and the 10 layer model is the worst since it is very overfitting (side note, this would manifest as the training error is near zero while the CV error is high). For your next attempt, it makes more sense to try models with 1, 3, or 4 layers instead of 20 or 200, since it seems like the 10-layer model is too complicated. In this way, you can iteratively refine the model search space, and this forms the basis of most hyperparameter search algorithms – a famous one is hyperopic. In practice, the search space is not 1D and the dimensions are NOT independent, but the idea is the same. They start with some random trials and then try and predict which areas of the search space are more productive. For complex models, this is the way to go, for simpler models and smaller search spaces a grid search is usually good enough.

Can you please suggest very basic books in machine learning for chemists?

An excellent machine learning book (not chem specific) that starts with the basics is Elements of Statistical Learning. That is where I started and I still use it! There is a pdf version on the Stanford website. Otherwise, the book Heather and I wrote is aimed at exactly this audience!

What Python libraries do you use the most?

PyTorch, RDKit, Pandas. I got a lot of use out of OpenBabel and Keras/Tensorflow previously.

How good do you need to be at math to learn machine learning?

It depends on what you want to learn. You can apply machine learning methods (and get good results) without really understanding how they work – many people who run computational chemistry calculations don’t necessarily know all the details of the simulations they run. However, getting a basic understanding of how machine learning works will obviously put you at an advantage. The good news is to understand and apply essentially all machine learning methods you only need some linear algebra and multivariable calculus. Even “how” neural networks work is mathematically simple – they are just collections of many neurons, each of which is pretty easy to understand. That said, there are more theoretical aspects such as how models generalize (or not), “why” large neural networks work so well, which involve some more complicated mathematics (mostly probability theory).

Can machine learning be used in all areas of chemistry, including inorganic and physical chemistry? What is the greatest difficulty in using it in a new area?

I think almost all scientific fields generate data in some form or another, and machine learning is really about if the patterns in that data can be used to predict future outcomes and so it can be applied in all these different applications. One issue in a new domain is lack of precedent as none of the existing descriptors or methods might represent your problem well – but this is also a great opportunity for domain knowledge to help set the machine learning problem up correctly! It can also be difficult to get enough data in specialized areas.

When the construction of metamodels that replace the usual numerical methods is implemented, what is the reduction in computation time in the calculations?

It depends on the model and what it is replacing. In the work I showed, I am swapping DFT calculations that take hours for machine learning models that take fractions of a second. To evaluate my full design space with DFT would have taken ~50 GPU-years, while the machine learning model takes ~4 seconds on a laptop. However, when comparing, for example, classical mechanics/force fields with neural networks, things are a lot closer. It is not my area but I believe the current neural network potentials tend to be slightly more expensive than force fields, although they give accuracy more comparable to DFT. An example paper where this is discussed is here. Some of the more complicated machine learning models could be considerably slower, but I don’t think there is any machine learning model that takes more than a few seconds per compound. Actually, classical machine learning methods such as SVR are often a lot slower than much larger neural networks, especially for large numbers of predictions, because neural networks typically scale linearly (while kernel inversion is cubic in principle).

Biomass decomposition is very complex. Can machine learning be used to carefully predict rate data of the decomposition of biomass, especially when we have about 50 data sets?

I doubt anyone knows – it depends on how the data points are represented and how ‘predictable’ they are. That sounds a bit vague so let me try to explain. One can usually fit Arrhenius rate expressions to reaction rates at a handful of temperatures. The reason is that the relationship between ‘x’ (inverse temperature) and ‘y’ (log rate) is very smooth (ideally linear). In the same way, the difficulty in predicting something comes down to how directly and smoothly the representation (x value) is related to the rate. For smaller data, how you represent them can be very important. Why not try it – especially with only 50 points you should probably use a very simple model, such as LASSO or ridge regression, and see what the results look like?

Compared with big data and other kinds of data, chemical data is a little bit small and expensive to enlarge. Which machine learning strategies can be employed to advance in this field?

While many types of chemical data are scarce, that is not always the case. There are publically available datasets of tens of millions of DFT calculations, millions of reactions mined from patents, and large databases like the CSD, materials project, open materials database, QOMD, PDB etc. Also, the amount of data needed can depend a lot on what you are trying to achieve – it may be that small data can be quite useful as long as it is similar to what you want to predict. However, in some cases, there is not enough data and one neat idea is transfer learning, i.e., building a model on one of these abundant data sources and then fine-tuning it on smaller datasets.

That was exactly the point – Jon uses terms such as minimal data and lots of data but it is hard to judge what minimal and lots actually mean. It is difficult to define how much data is ‘enough’ since different applications have different requirements and some datasets are just easier than others. As a toy example, consider fitting rate expressions to an Arrhenius plot (this is nothing other than a simple linear regression problem). You can probably get good kinetic parameters from a handful of reactions because the relationship between 1/T (the x variable) and log Rate (the y variable) is simple and smooth. It doesn’t need a very complex regression function to describe this data and so we don’t need much. In principle, the same applies to more complicated cases – if the way you represent your data, the x values, is smoothly correlated with what you are trying to predict, it is much easier. So it is not only a function of the amount of data but also how you represent it, and how accurate a model needs to be useful. In drug discovery, even a very weak model that is correct 10% of the time is very useful, since getting one active molecule for ten tries is a really good result. If you really want to put a number on things, I wouldn’t look too much into datasets with <50 points unless you can basically see the model by eye (as in the Arrhenius case), and I would only go to ‘deep’ models like neural networks with at least a few hundred (however, many classical algorithms such as LASSO could be very useful in that range).

Do you know of a community (Stack, Facebook group/page) to join and interact about machine learning?

I find the machine learning subreddit is very lively and even well-known people in the machine learning community such as Yoshua Bengio participate occasionally.

Is it possible to make predictions of free energy in reactions?

Yes, here is an example paper predicting barriers – they even make their trained model freely available so you could try it. In general, with enough data, we expect to predict any physical quantity. However, you can notice that they use quite a large quantum chemical dataset to train the model.

Sign-up for news and communications from ACS eBooks!

Connect with ACS Editors at the 25th Annual Green Chemistry & Engineering Conference

ACS Publications is a proud sponsor of the 25th Annual Green Chemistry & Engineering (GC&E) Conference and is pleased to have several of its journals and their editors participating in this year’s event. GC&E will be a virtual conference for the second time and will take place June 14-18, 2021, with a focus on the theme “Sustainable Production to Advance the Circular Economy.”

ACS Sustainable Chemistry & Engineering Lectureship Awards Keynotes

The Green Chemistry Institute is the co-sponsor of the 2021 ACS Sustainable Chemistry & Engineering Lectureship Awards, and each of the three award winners will be presenting a keynote address at GC&E on Wednesday, June 16, 10:30 a.m. – 12:00 p.m. EDT.

Nominate an outstanding colleague for the 2022 ACS Sustainable Chemistry & Engineering Lectureship Awards or for one of these two awards from Environmental Science & Technology and Environmental Science & Technology Letters:

Meet ACS Editors During GC&E Daily Networking Breaks

Each day of the conference, attendees will have the opportunity to join a variety of online networking breaks, including several hosted by ACS Publications journals and ACS Symposium Series eBooks.  During these ACS journal sessions, Editors-in-Chief and members of their editorial teams will give brief presentations and then take questions from attendees.

Use these opportunities to connect with ACS Editors and learn more about their journals or eBooks, recent updates and advances,  and the publishing process:

Monday—ACS Symposium Series eBooks

A discussion of the Sustainability & Green Polymer Chemistry Volume 1: Green Products and Processes from ACS Symposium Series eBooks will be held with H.N. Cheng, Editor and 2021 American Chemical Society president, and Richard A. Gross, Editor and 2018 recipient of the Affordable Green Chemistry Award from the ACS.

Monday, June 14, 12:00 p.m. – 12:55 p.m. EDT

Tuesday—ACS Sustainable Chemistry & Engineering

What’s new with ACS Sustainable Chemistry & Engineering: Letters from Emerging Scholars video series, a new approach for recruiting editorial board members, and the latest in journal front matter with:

  • Editor-in-Chief David Allen
  • Associate Editor Audrey Moores
  • Editorial Advisory Board Chair Paul Anastas
  • Executive Editor Peter Licence

Tuesday, June 15, 12:00 p.m. – 12:55 p.m. EDT

Wednesday—ACS Chemical Health & Safety

Learn tips for publishing your work in ACS Chemical Health & Safety from the editors. Plus, find out more about the upcoming Virtual Special Issue “Safety Policy, Regulations, and Codes from Around the World” and its connection to green chemistry.

Wednesday, June 16, 12:00 p.m. – 12:55 p.m. EDT

Thursday—ACS Agricultural Science & Technology and ACS Food Science & Technology

Learn about the Journal of Agriculture and Food Chemistry expansion from a single journal to a family of three with the 2020 launch of ACS Agricultural Science & Technology and ACS Food Science & Technology with

  • Laura L. McConnell, Deputy Editor of ACS Agricultural Science & Technology
  • Coralia Osorio Roa, Deputy Editor of ACS Food Science & Technology

Thursday, June 17, 12:00 p.m. – 12:55 p.m. EDT

Friday—ACS Omega

Learn about the research featured in the new ACS Omega joint virtual issue “Green Chemistry: A Framework for a Sustainable Future,” which also includes articles from ACS Sustainable Chemistry & Engineering, I&ECR, The Journal of Organic Chemistry, Organic Letters, OPR&D, and Organometallics. Also, learn more about ACS Omega and publishing open access in general with:

  • Associate Editor Adelina Voutchkova-Kostal
  • Publishing Editor Paul Goring

Friday, June 18, 12:00 p.m. – 12:55 p.m. EDT

Call for Authors: New Dynamic ACS In Focus Series

ACS In Focus is seeking an author or author team to write on a broad range of emerging topics for a new series of brief e-books.

The purpose of these e-books is to get new graduate students up to speed on topics they might not have learned as an undergrad but need in their graduate research. ACS In Focus e-books are also designed to introduce new topics to scientists interested in bringing additional and often multidisciplinary perspectives into their lab. With this work, you have the opportunity to reach a wide audience of young minds and prepare them for a successful research experience.

This is your chance to be part of an amazing new e-book series! In 2020, ACS In Focus will publish 10 titles as part of the Inaugural Collection; in 2021 we will begin a regular schedule of publishing twenty titles per year. ACS in Focus Editors are seeking authors for a broad range of topics, including drug development, solar energy conversion, neurochemistry, biology for chemists, and more. As these works are intended to be read in four-to-six hours and offer the fundamentals of a topic, the length of your work should be approximately 35,000 words.

If you are interested in participating in this series, please email: infocus@acs.org.

Call for Authors: Robots in Chemistry

ACS In Focus is a new series of brief, dynamic e-books designed to bring new graduate students up to speed on topics they need to know in their research. As well, these works are designed to introduce topics to scientists interested in bringing additional, and often, multidisciplinary perspectives into their research questions. The series is off to a great start and covers a broad range of topics. The first title, Machine Learning in Chemistry, by Heather Kulik & Jon Paul Janet, published in May 2020 and has had over 25,000 visitors to the product page in June 2020.

Robots in chemistry is a topic of great interest to ACS readers. The ACS In Focus editors are seeking an author or author team to author the foundations around this topic, keeping the work to approximately 35,000 words which translates into a four to six-hour read. Many of our authors select co-authors from their research team, including grad students, post-docs, or former students and colleagues. With this work, you have the opportunity to reach an even wider audience of scientists and introduce them to this flourishing area of research.  

If you are interested in authoring, please contact infocus@acs.org

Call for Authors: Mathematics for Machine Learning in Chemistry

ACS In Focus is a new series of brief, dynamic e-books designed to bring new graduate students up to speed on topics they need to know in their research. As well, these works are designed to introduce topics to scientists interested in bringing additional and often multidisciplinary perspectives into their research questions.

The series is off to a great start and covers a broad range of topics. The first title, Machine Learning in Chemistry, by Heather Kulik and Jon Paul Janet, had over 25,000 visitors to the product page in June 2020. In response, students have shared that they’d like an overview in this series on mathematics for machine learning.

The ACS In Focus editors are seeking an author or author team to write approximately 35,000 words on the foundations of “Mathematics for Machine Learning in Chemistry.” With this work, you will have the opportunity to reach an even wider audience of scientists and introduce them to this flourishing area of research. Many of our authors select co-authors from their research team, including grad students, post-docs, or former students and colleagues.

If you are interested in authoring, please contact infocus@acs.org.

Call for Authors: CRISPR/Cas9 For Chemists

ACS In Focus is a new series of brief, dynamic e-books designed to bring new graduate students up to speed on topics they need to know in their research. As well, these works are designed to introduce topics to scientists interested in bringing additional and often multidisciplinary perspectives into their research questions.

CRISPR’s potential across STEM research is exploding. The ACS In Focus editors are seeking an author or author team to write approximately 35,000 words on the foundations around this topic. Many of our authors select co-authors from their research team, including grad students, post-docs, or former students and colleagues.

The series is off to a great start and covers a broad range of topics. The first title, Machine Learning in Chemistry, by Heather Kulik & Jon Paul Janet, published in May 2020 and had over 25,000 visitors to the product page in June 2020.

With this work, you have the opportunity to reach an even wider audience of scientists and introduce them to this flourishing area of research that won researchers Emmanuelle Charpentier and Jennifer Doudna the 2020 Nobel Prize in Chemistry.  

If you are interested in authoring, please contact infocus@acs.org.