ACTIVE LEARNING AND ITS APPLICATIONS IN DRUG DISCOVERY
Active learning is a special case in machine learning where the algorithm interactively queries a user (or another information source) to label new data points with the desired outputs for a given input. It is a key method to create a great machine learning model while keeping the number of labeled datasets to a minimum by selecting only the most important data points.
This technique is also applied in scenarios where labeling is challenging or time-consuming. Unlike passive learning, which involves a human oracle creating a large volume of labeled data and demands significant man-hours, active learning optimizes the labeling process.
In an effective active learning system, the algorithm should be capable of selecting the most informative data points based on a specific metric, which are then passed to a human labeler. These labeled data points are progressively added to the training set. The process is illustrated in the diagram below.
Need for Active Learning:
“Not all data points are equally valuable for training a model”
Consider the data points shown below in (Fig-2a) clusters separated by a decision boundary. Manual labeling would be impractical and costly in a scenario with tens of thousands of unlabeled data points. Randomly labeling a subset of this data for training the model would lead to poor performance of the model (as illustrated in the image Fig-2b). The decision boundary formed by this random sampling often leads to lower accuracy and other subpar performance metrics. However, selectively labeling data points near the decision boundary would be more effective. This selective approach to sampling is the foundation of active learning Fig-2c.
How does Active Learning work?
In a typical active learning process, the algorithm identifies the most valuable data points, which might include edge cases, and asks for them to be labeled by a human. These newly labeled data points are then incorporated into the training set, and the model undergoes retraining. This process of selecting data points is commonly referred to as "querying".
Process of training an active learning model:
When to stop Active Learning:
When implementing Active Learning, it's crucial to determine the optimal stopping point before exhausting the budget, especially when model performance is near its peak. A stopping criterion, which monitors specific metrics against predefined conditions, is necessary.?
Stopping condition can be one among the below:
Active Learning strategies:
Selective/Stream-Based Sampling:
Stream-based selective sampling involves the continuous input of unlabeled data into an active learning system. The system then uses a predefined learning strategy to determine whether to send the data to a human oracle for labeling. This approach is suitable for situations where the model is operational and the data sources or distributions change over time.
Pool-Based Sampling: This method involves establishing a pool of unlabeled data, from which the model identifies the most informative examples to be labeled either by an expert or a human annotator. Pool-based sampling encompasses various techniques such as uncertainty sampling, query-by-committee, and density-weighted sampling.
Synthesis Methods:
The core idea is that examples near the classification border are less clear-cut, so labeling them yields the most insights for the learning algorithm. We achieve this by pinpointing points close to the decision boundary by generating new ones or picking existing ones based on Euclidean distance methods.
Active learning strategies for subsampling:
Choosing an effective query strategy is crucial since it determines which data points are informative for labeling and subsequent training. The active learning pipeline's efficiency hinges on how swiftly the query strategy can identify the most impactful sample from the unlabeled dataset. A Few strategies used in active learning subsampling are the following:-
Uncertainty Based Sampling:
Uncertainty sampling is a query strategy designed to select samples that will most effectively reduce a model's uncertainty. This uncertainty is usually quantified using metrics like entropy ,least confident ,margin-based uncertainty. By focusing on samples with high uncertainty, the model can significantly improve its performance once these samples are labeled.
Heatmaps above show query behavior for uncertainty measures in a three-label classification. Simplex corners indicate high probability for one label, with the opposite edge showing probabilities for the other two. Dark red areas highlight the most informative query regions.
Diversity Sampling:
Diversity-weighted methods prioritize selecting examples for labeling based on their diversity within the current training set. This involves ranking the pool of unlabeled examples using a diversity measure, such as the dissimilarity between examples or the uncertainty in the model’s predictions.
领英推荐
Expected Error Reduction:
Expected error reduction is an active learning method that selects examples for labeling based on their potential to decrease the model's prediction error. The goal of this approach is to identify examples that, when labeled, will most significantly enhance the model's performance on new, unseen data.
In this method, unlabeled examples are ranked by estimating how much each example's labeling would reduce the model's prediction error. This estimation can be based on various criteria, such as the distance to the decision boundary, the margin between predicted labels, or the expected reduction in entropy.
Expected Model Change:
In this strategy, the algorithm selects the data instance that induces the greatest change in the model. This change can be quantified by measuring the gradient associated with that data instance during the stochastic gradient descent (SGD) process.
Query-by-committee:
This query strategy involves training multiple models on different subsets of the labeled dataset and selecting samples based on the level of disagreement among these models.?
This approach is particularly effective when the model tends to make errors on certain samples or classes. By choosing samples where the committee of models disagrees, the model can learn to better recognize these sample patterns, thereby enhancing its performance in those specific classes.
Applications of Active Learning in Drug Discovery:
Predictive Modeling:
Active Learning iteratively selects the most informative compounds for model training, improving the predictive power of models for various properties such as activity, toxicity, and pharmacokinetics, which reduces late-stage failures.
Ex:-? It is useful in reducing prediction uncertainty by actively selecting compounds where the model is most uncertain in any of tasks like cardio toxicity prediction, hepatotoxicity prediction and more, leading to more robust and reliable models with more accuracy.
Lead Optimization:
Active learning aids in selecting compounds that maximize information gain for understanding structure-activity relationship, helping to refine and improve lead compounds efficiently.
It helps in optimizing multiple properties (E.g.: potency, selectivity, solubility, and permeability) simultaneously by focusing on compounds that improve these properties.
Ex:- It is useful in cases like molecule optimization when given properties like solubility, CaCO2 permeability and more, to be satisfied. Then active learning helps to sample compounds that maximize information gain, which in turn generate molecules that satisfy given QSAR properties.
Chemical Space Exploration:
Active learning can guide the exploration of chemical space to discover novel compounds with desired biological activities, focusing on areas that are most likely to yield fruitful results.
By selecting diverse compounds, active learning ensures a broad exploration of chemical space, which is crucial for identifying unique scaffolds and avoiding redundancy.
Ex:- It helps in building docking surrogate models with fewer molecules but has knowledge of a huge library’s chemical space such as Enamine, Zinc, ChemSpace, etc to accurately predict binding affinity values for any set of given molecules.
Clinical Trial Design:
Active Learning can optimize patient selection for clinical trials by identifying the most informative patient subgroups, improving trial efficiency and success rates.
Ex:- It supports adaptive trial designs by continuously updating the trial based on incoming data to focus on the most promising treatment arms.
Conclusion:
Active learning, a machine learning technique gaining significant research interest particularly in natural language processing and computer vision, allows algorithms to iteratively select the most informative data points from a large unlabeled pool for human annotation. This targeted approach reduces annotation costs and improves model performance compared to traditional methods. However, implementing active learning can be challenging, as readily available solutions are scarce.? For those with the necessary expertise, developing a custom system based on existing frameworks is a viable option. Alternatively, commercially available tools offer a faster route to implementation, but often necessitate significant customization for specific use cases.
References:
[1]: B. Settles, Active Learning Literature Survey (2009), Computer Sciences Technical Report ? ? 1648, University of Wisconsin–Madison
Author: ?Divyavani Darapuneni