Researchers from Massachusetts Institute of Technology (MIT; MA, USA) have created a machine learning model to improve the speed and quality of lead optimization. The research, presented at the International Conference on Machine Learning (10–15 July 2018; Stockholm, Sweden), highlights the ability of the model to select lead compounds based on both desired properties and potency.
Chemists often rely on expert knowledge to manually ‘tweak’ lead molecules by adding or subtracting functional groups. While systems to predict optimal chemical properties do already exist, chemists are required to carry out each modification step, in a process that often takes hours and may not result in a drug candidate.
The new model, designed by researchers from both the Computer Science and Artificial Intelligence Laboratory and the Department of Electrical Engineering and Computer Science, is hoped to replace the human modification steps.
"The motivation behind this was to replace the inefficient human modification process of designing molecules with automated iteration and assure the validity of the molecules we generate," commented Wengong Jin, a PhD student and lead author of a paper describing the model.
Unlike other models, which run on simplified molecular-input line-entry systems (SMILES), this system runs directly from the molecular graphs. Molecular structure data is utilized to directly create molecular graphs, where nodes represent atoms and edges represent bonds. The graphs are broken into clusters of functional groups, used as building blocks to more accurately reconstruct and modify molecules.
Want to learn more about artificial intelligence in drug discovery? You may also be interested in:
The system then moves onto the decoding phase: the molecular graph is reconstructed in a “coarse-to-fine” manner, where the resolution of the low-resolution image is gradually increased. This generates the tree-structured scaffold, and assembles the associated clusters, or functional groups, together in a coherent molecular graph that is an exact replica of the original source.
During lead optimization the model can be utilized to modify lead compounds based on a preferred property; this is done with the help of a prediction algorithm that scores each molecule with a potency value of that property.
The model optimizes a lead molecule by using the algorithm to modify its vector, and therefore structure, by changing the functional groups to achieve a higher potency score. This is repeated until the highest predicted potency score based on the functional groups is found. The new molecule is decoded from the updated vector by compiling the corresponding clusters.
The artificial intelligence system was trained on approximately 250,000 molecular graphs, taken from the ZINC database, and was tested on tasks including generating valid molecules, finding the best lead candidates and designing novel molecules with higher potencies.
Research highlighted at the conference suggests the model was capable of generating 100% chemically valid molecules from a sample distribution, compared to models based on SMILES which generated only 43% valid molecules.
The data also suggest the ability of the model to create new molecules, based on existing leads, averaging a more than 80% increase in potency.
The researchers aim to test the model on more properties that are more therapeutically relevant, but this will require more data.
"Pharmaceutical companies are more interested in properties that fight against biological targets, but they have less data on those. A challenge is developing a model that can work with a limited amount of training data," Jin concluded.