Using artificial intelligence to predict outcomes of organic chemistry at IBM Research – Zurich
Using a text-based representation of molecules, chemical reactions are accurately predicted with a neural machine translation model, used in natural language processing.
Philippe Schwaller1, Theophile Gaudin1, Teodoro Laino1
1IBM Research – Zurich (Rüschlikon, Switzerland)
A problem of memorizing and assimilating concepts
For more than 200 years, the synthesis of organic molecules has remained one of the most important tasks in organic chemistry, being a key part of an efficient drug discovery process.
In the last 5 years, a growing research activity tried to change the age-old practices dramatically and allow a new era of productivity, bringing artificial intelligence (AI) technologies directly in the organic chemist’s laboratories.
The challenge for organic chemists in fields such as chemistry is that there are hundreds of thousands of reactions collected in several decades, and while it is manageable to remember a few dozen in a narrow specialist’s field, it’s impossible to be an expert generalist.
Organic chemistry and natural languages
To address this, IBM Research – Zurich embarked on the task of understanding if the current artificial intelligence technologies can help synthetic organic chemists to predict reactions of organic compounds.
To construct proper mathematical models, the IBM scientists reverted to the use of a well-known string representation in which atoms in organic molecules are represented with letters and molecules represented with words. It is using this analogy that a few years ago other scientists provided suggestions that organic chemistry datasets and natural language datasets have a lot in common : they both depend on specific grammar, on long-range dependencies and a small particle or word, like “not”, can change the entire meaning of a sentence. Similarly, stereochemistry can turn thalidomide into either a medication or a deadly poison.
With some sort of good skepticism, the IBM team decided to explore technologies belonging to the domain of natural language processing to train mathematical models that would capture the essence of the organic chemistry reactivity and provide a way to predict the outcome of novel chemical reactions. The idea of relating organic chemistry to a language resulted in the natural application of state-of-the-art neural machine translation methods, which are sequence-to-sequence (seq2seq) models. These AI models are the same as those routinely used in translation problems among natural languages. By using this technology, the IBM team paves the way to solve the forward-reaction prediction problem entirely driven on data, where the starting materials are known and the interest is in generating the products.
The entire idea has been worked into a software that takes the graphical representation of the reactants and can predict how molecules will react within a click.
The overall tool is simple and the model is trained end-to-end, fully data-driven and without the aid of querying a database or any additional external information. With this approach, the IBM team outperforms current solutions using their own training and test sets by achieving a top-1 accuracy of 80.3% and set a first score of 65.4% on a noisy single product reactions dataset, extracted from US patents.
The secret behind our tool is what is called a simplified molecular-input line-entry system, or SMILES. SMILES represent a molecule as a sequence of characters, for instance, the image above is represented with the SMILE:
The entire training of the AI model was achieved using an openly available chemical reaction dataset, which corresponds to 1 million patent reactions.
The complete details of this work have been presented in a publication in Chemical Science .
Artificial Intelligence as a tireless assistant
This work is a great example of how AI can draw upon large quantities of knowledge that would be astonishingly difficult for a human to master – particularly when it needs to be updated on a daily basis. It’s a frequently discussed fear, the one that sees AI slowly replacing the humans in their daily tasks: for the case of application to organic chemistry, AI won’t replace the human, it will augment the ability of scientists by digesting large amount of data otherwise difficult to assimilate. They will provide insight into possible new chemical reactions, but it will finally the human creativity and judgement to decide if the prediction is a meaningful one or not.
In the future, the IBM team aim to enhance the model, having in mind the goal of reaching and surpassing the 90% level of accuracy in prediction. Currently our data is taken from information publicly available in US patents published online, but there is no reason why the tool couldn’t be trained on data from other sources, such as chemistry textbooks, scientific publications or company proprietary data.
IBM Research plans to make this tool publicly available for free on the cloud in August 2018. People interested can sign-up on www.zurich.ibm.com/foundintranslation to receive an alert when the web-tool is ready for use.
- Cadeddu A, Wylie EK, Jurczak J, Wampler-Doty M and Grzybowski BA. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analysis. Angew. Chem. Int. Ed. Engl. 53(31), 8108–8112 (2014)
- Schwaller P, Gaudin T, Lanyi D, Bekas C and Laino T. “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. doi:10.1039/C8SC02339E (Epub ahead of print) (2018)