How drug-like are vendor libraries?

Vendor libraries overlap

Vendor libraries overlap

We use a few chemical compounds vendors in our drug discovery programs (let us call them “Provider I”, “Provider II”, and “Provider III”). Nowadays chemicals providers normally claim about 1M of drug-like compounds readily available for shipping at a very reasonable cost. A few dozen of providers, with up to 10 of them very large overall could give up to a 10M of distinct compounds. How many compounds are out there?

To optimize our in-silico operations we use clustering to collect molecules of similar structure. The cluster centroids represent the clusters in the lead identification process and hence the number of clusters is a good measure of chemical diversity of a library. To cluster  the compounds we use Tanimoto similarity criterion, since the metrics lets us use fingerprint sorting to avoid much of pair-distance calculations. The number of common structures is a measure of (dis)similarity between a two chemical libraries. With this in mind we performed a co-clusterisation of the three vendor libaries (see the Graph on the left). It appears that the vendors II and III are roughly of the same “size” (diversity), whereas the vendor I has the most diverse collection of the compounds. What’s remarkable, is that number of the compounds is the largest in the collection II and the smallest in the collection III.

Now we are ready to see what the claimed “drug likeness” of the compounds might mean. There are two great online libraries containing all the (small molecules) drugs (Drug Bank) and a large collection of the compounds with identified activity against specific molecular (proteins) targets. To see how the compound libraries and the biologically active compounds relate, we co-clustered each of the vendor libraries with those obtained from the DrugBank and BindingDB databases:

The results are remarkable in a few ways. First of all, all the three vendor libraries are very similar in their properties. Each of them contains roughly the half of the similarity classes representing the known drugs. This means that half of the current drugs is not “drug like enough” to be picked up by the modern “drug like” compound selection algorithms. Still, the number of stable compounds of reasonable size is about one or two orders of magnitude larger than the size of the modern vendor libraries. This means that we still have a long way to go on a chemical synthesis progress road to cover the chemical diversity enough at least to “rediscover” just the already known drugs!

There is also another remarkable conclusion: about 30% of the compounds classes overlap with the biologically active compounds from the BindingDB and therefore up to a one third of the compounds is biologically active! This may be a signature of a major library construction flaw: the compounds where selected to be “drug like”, meaning rather similarity to compounds with known biological activity. In practice such promiscuity could mean a lot of side effects and toxicity.

Related posts:

  1. Drug likeness: what do bioavailability and toxicity properties tell us about druglikeness?

About Peter Fedichev, Quantum CTO

Peter Fedichev, Ph.D., Chief Scientific Officer, co-founder