Tales of Similarity

The concept of molecular similarity is central to many applications in chemoinformatics and medicinal chemistry. While similarity is an intrinsically subjective concept, we attempt to quantify it by comparing molecular representations.

Go to the profile of Jürgen Bajorath
Jun 12, 2015

First and foremost, this is attempted through application of the famous (or infamous, as one of the other medicinal chemist might say) Tanimoto similarity metric. To avoid reiterating the literature (e.g., Maggiora et al. Molecular similarity in medicinal chemistry. J Med Chem 57, 3186, 2014) I'd like to draw attention to only one of the key issues in molecular similarity analysis (the most important one, perhaps): we typically try to predict active compounds on the basis of similarity calculations. Simply put, if a test compound is found to be 'similar' to known active molecules (e.g. on the basis of Tanimoto similarity calculations) we assume that the test compound also has a high probability to be active. As has been pointed out at least a few times in the literature, there are SAR caveats to consider when judging about similarity relationships in this way. Nonetheless, essentially all computational approaches currently employed for ligand-based virtual screening operate on the basis of this similarity relationship argument. What should be stressed, however, getting to the point of this post, is that there currently is no reliable way to confidently correlate calculated molecular similarity with observed biological activity similarity (regardless of how molecular similarity is quantified). In fact, I would consider our current inability to infer from calculated molecular similarity to activity similarity (and generalize activity predictions) to be one of the major unsolved problems in computational medicinal chemistry. Importantly, this is not only an academic issue, but one with profound practical implications. Let's just consider the fact that database rankings produced by virtual screening are generally enriched with many false-positives (at high ranks), a problem practitioners in computational compound screening face on a daily basis. Trying to tackle this problem scientifically is an equally challenging, interesting, and rewarding task. At the least, raising awareness of it is considered to be important for the medicinal chemistry field.

Go to the profile of Jürgen Bajorath

Jürgen Bajorath

Professor , University of Bonn

For further details, see: http://www.limes.uni-bonn.de/forschung/abteilungen/Bajorath/labwebsite


Go to the profile of Gerald Lushington
Gerald Lushington almost 5 years ago

Well put! Any reliance on any current generic measure of molecular similarity (especially, but not limited to, the heavily used distance-based formalism within a space defined by substructural fingerprints) as a basis for med chem goals such as SAR analoging is highly imperfect. Conventional similarity concepts are not without some technical value, but surely we as a community can do far better.

Some time ago my colleagues and I took a crack at this issue with an article entitled "Novel Algorithms for the Identification of Biologically Informative Chemical Diversity Metrics" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2753527/). In this work we asserted that a major flaw with diversity metrics (which, I would argue, are basically similarity metrics examined through a mirror) lay in the attempt to find generic measures to quantify biological disposition of compounds, without recognizing how the property-dependence differs vastly from target to target, and even within different mechanistic variants of the same target. The paper thus proposed a simple-minded formalism for winnowing down a large pool of molecular properties to hone in on those that seem best suited to discriminate between active and inactive molecules within a screening data set for a given target.

In retrospect, the specific test cases chosen in that paper to illustrate the method are faulty (we focused on several of the old NCI tumor cell lines, which have since been largely deprecated), so perhaps the data quality was inadequate to derive real insight from, but the simple method (and some more sophisticated schemes we've mapped out since) may provide some incremental progress toward the challenge Professor Bajorath has identified.