Virtual Screening. Validated.
In so far as science is about large-scale validation using diverse data, the field of computational drug discovery has always been at something of a disadvantage. Anecdotal data and cherry-picking are rife, and publications are partly distorted by the ambition of proving that some modeling technique is “the best”.
The field of virtual screening has particularly been a victim of this problem. While virtual screening has undoubtedly come of age now and has proven to be useful for many targets, one has to be very careful in both the methodology and the interpretation. Many pronouncements from virtual screening are of the “We screened a million molecules, skimmed the top 10%, tested them and found four hits ranging between 10 and 50 μM” kind. While such pronouncements indicate useful hits, they are not exactly a validation of the accuracy and range of the screening effort. Would the hit rate have been different had more molecules, or perhaps a subset, been screened? How novel are the hits? How does the method comparing to simpler ones? Is there something about the target or perhaps about the sampling and scoring technique that privileges that particular method over the others? Many such questions remain unanswered.
That is why a pair of two recent papers from the Shoichet and Irwin labs are so valuable - first, because they validate what is only conjectured, that larger libraries will lead to more, better hits; and second, because they have performed a significant public service in making available what I believe is the largest open-access collection of virtual screening results and the experimental follow-up that have been put into the public domain until now.
In the first paper, the authors do a head-to-head comparison of two VS campaigns by screening 1.7 billion molecules using their longstanding DOCK docking protocol against AmpC, and comparing the screen to a previous 99 million molecule screen. They synthesize and test 1521 compounds from the ensuing results, compared to 44 for the previous campaign. They ask important questions, like whether hit rates correlate at all with docking scores (which are a notoriously unreliable measure of binding affinity), how well the poses compare with crystallographic poses and whether screening more compounds leads to more real hits. Several interesting observations emerge.
In the previous 99 million screen where 44 high-ranking molecules had been selected for synthesis and testing, five new inhibitors with activities ranging from 1.3 to 400 μM emerged. In the current 1.7 billion compound scenario, 1521 were synthesized and tested: of those, 1447 were experimentally well-behaved and 1296 were among the top 1% by score. Of those 1296, 168 had an apparent Ki of <166 μM and 122 had Ki values between 166 and 400 μM. Appropriate controls for aggregation (a confounding artifact the group has pioneered) were put in place. Overall, the hit rate from the larger campaign was 22.4%, about twice that of the 11.4% from the earlier 99 million campaign (for perspective, the bigger screen was 17x the smaller).
There are some differences in the way the molecules were selected: unlike in the smaller screen, the molecules from the larger screen included ones picked both manually and by score alone. Interestingly, if you picked the same numbers from the big and small screens (44), the hit rate difference is even more pronounced: 47.7% vs 11.4%. This observation indicates that picking the cream of the crop, so to speak, from a bigger screen could potentially lead to very high enrichment; 47.7% is a stupendous hit rate, since success from virtual (and high-throughput) screens can range wildly depending on the library and the target, with hit rates between 0.1% and 1% not being uncommon.
The authors also find, perhaps unsurprisingly, that when larger numbers of molecules are tested (this time looking at three targets, including the D4 and the sigma-2 receptors), there is an exponential correlation between affinity and hit rates. But crucially, simulations indicated that the variability of the results increases as the number of molecules tested drops and confidence in the true hit rate drops. Therefore - and if there’s a single practical message that should be taken away from the paper, it’s this one - testing more than 100 molecules from a virtual screening campaign might be a good minimum cutoff for getting reliable, high-affinity hits. This does not of course mean that it will work for any target and any library, but it does seem like a good minimum number to test.
There’s other valuable learnings in the paper. For instance, hit rates fell monotonically as the docking scores worsened, which tells you that whatever their merits, high docking scores are more reliable than lower ones. This allowed the authors to surmise a kind of loosely predictive relationship between docking score and a qualitative affinity range (high, medium, low). But the exercise also points to the need for improving scoring functions, especially in the medium and low bins.
领英推荐
The general conclusions of the paper seem clear. Larger libraries do seem to give better hit rates and higher potencies. Novel chemotypes are discovered, many crystallographically verified. The number of new inhibitors scales almost linearly with the number of molecules tested; testing 29-fold more molecules led to a 59-fold bigger inhibitor pool. Testing smaller numbers leads to more variability and errors. And all this should be put in perspective with the standard caveats: nobody can test or crystallize more than a fraction of molecules from a large library, even with ample resources, nobody can measure full dose-response IC50 curves for every single inhibitor, and so on. But the central conclusion of “larger means better” seems to be clear in this paper, and that provides a welcome message for both computational chemists and chemical library vendors.
While this kind of statistical study is undoubtedly valuable, wouldn’t it be even more valuable to be able to do a meta-analysis, something that’s standard in areas like clinical trials but hardly seen in computational chemistry? Glad you asked! Because the second paper which came out last month does exactly that. The Shoichet-Irwin lab has been at the forefront of virtual screening for at least two decades, and they have decided to put all their accumulated results online for our benefit. To this end, they have develop a website (conveniently abbreviated as “LSD”, perhaps with the implication that the wealth of data it provides might put some modelers into a dreamlike state) providing access to recent large library campaigns, including poses, scores, and in vitro results for campaigns against 11 targets, with 6.3 billion molecules docked and 3729 compounds experimentally tested. As far as I know, this is the largest cache of modeling results that has been made publicly available. The data is conveniently organized by target and paper (since sometimes multiple papers refer to the same target, or a single paper refers to multiple targets).
Among other things, every time there’s such a deluge of data, there’s now a machine learning technique that’s ready to be applied to it, and that’s what the authors do as well. They use a method called Chemprop which predicts chemical properties on three targets (AmpC, 5HT2A and Sigma-2 ) and experiment with training set sizes of 1000, 10,000, 100,000 and a million compounds using a variety of sampling strategies ranging from random to more focused. There’s a lot of statistics in the results that’s worth a look if you are a cheminformatics geek, but some major conclusions emerge.
First, increasing the training set size unsurprisingly improved model performance across all techniques (noting that even a million molecules constitute only 0.07% of the total library size). More interestingly, they trained a technique called Retrieval Augmented Docking to predict the DOCK score at much reduced computational cost, and found that the best-performing models were comparable to the full DOCK scoring function; this is an exercise that is of abiding interest, since ML techniques are often much cheaper and faster than physics-based techniques.
Over the course of just the last decade or so, two techniques have revolutionized virtual screening. One is the availability of make-on-demand molecules, from Enamine in particular but also from other vendors. If you had said ten years ago that you would be routinely able to virtually screening billions of purchasable and synthesizable compounds in a matter of hours, you would have been seen as a dreamer. But the dream is now reality thanks to the dedicated folks at places like Enamine. The second huge advancement is the effort by scientists - and the Shoichet-Irwin lab must rank at the very top of these efforts - to bring a variety of tools and careful investigations to bear on the problem and make large-scale virtual screening a reality. From head-to-head comparisons of VS and HTS to teasing apart the impact of artifacts like aggregators on screening assays to performing careful statistical analysis of the results, the group has deconvoluted pretty much every important aspects of the science and the technology. Nowhere else are the results of that meticulous work clearer than in this pair of papers that quantify the effects of scale and make all the data available, setting up the technique for our brave new world of AI.
And there’s an obvious message in there for AI practitioners: without the decades-long hard work leading to results like these, not to mention the countless analysis and curation and efforts that went into building the PDB and other databases, there would be nothing but empty air for those techniques to bite on. AI is sometimes seen as a fresh new spring gushing from a mountainside, but it’s crucial to recognize - both scientifically and materially - the wealth of processes and data hidden inside the mountain that make it possible and useful.
There’s often a tendency to downplay the impact of computation in medicinal chemistry. But that is precisely why it is even more important to highlight dogged efforts like this which demonstrate that good work takes time and teamwork, and that its impact is slow but inevitable. It’s Very Good Science, in the classic definition of the term, and we as a community are all better off for it.
Life Science Strategic Consultant | Business Development & Sales | Motivated to Satisfy Customer Needs
1 周Fantastic writeup Ash Jogalekar! Really useful for highlighting the contributions of the past in this current age of critically assessing ML and physics based predictions! Thank you for sharing!
PhD Student at UAB Heersink School of Medicine
1 周Really enjoyed the article. I second to your thought that nothing in AI based work could be achieved without the decade long hard work of countless scientist. Thanks a lot for sharing this beautiful writeup.