ACTIVE LEARNING AND ITS APPLICATIONS IN DRUG DISCOVERY

ACTIVE LEARNING AND ITS APPLICATIONS IN DRUG DISCOVERY

Active learning is a special case in machine learning where the algorithm interactively queries a user (or another information source) to label new data points with the desired outputs for a given input. It is a key method to create a great machine learning model while keeping the number of labeled datasets to a minimum by selecting only the most important data points.

This technique is also applied in scenarios where labeling is challenging or time-consuming. Unlike passive learning, which involves a human oracle creating a large volume of labeled data and demands significant man-hours, active learning optimizes the labeling process.

In an effective active learning system, the algorithm should be capable of selecting the most informative data points based on a specific metric, which are then passed to a human labeler. These labeled data points are progressively added to the training set. The process is illustrated in the diagram below.

Need for Active Learning:

“Not all data points are equally valuable for training a model”

Consider the data points shown below in (Fig-2a) clusters separated by a decision boundary. Manual labeling would be impractical and costly in a scenario with tens of thousands of unlabeled data points. Randomly labeling a subset of this data for training the model would lead to poor performance of the model (as illustrated in the image Fig-2b). The decision boundary formed by this random sampling often leads to lower accuracy and other subpar performance metrics. However, selectively labeling data points near the decision boundary would be more effective. This selective approach to sampling is the foundation of active learning Fig-2c.

  • Minimizing Labeling Cost and Efficient use of resources: Labeling data particularly when it requires expert knowledge such as in the case of drug discovery, can be both costly and time-intensive. Active learning addresses this issue by targeting the most informative samples, significantly reducing the need for extensive labeled datasets. Active learning streamlines the process by selectively identifying and requesting labels for the most valuable data points, making it resource-efficient.
  • Higher accuracy and Enhanced learning from limited data: Active learning prioritizes data points that most likely enhance the model's performance, enabling it to achieve higher accuracy with fewer labeled instances when compared to traditional methods. In situations where labeled data is limited, active learning maximizes the model's ability to learn effectively from the available data.
  • Addressing Uncertainty and Managing Imbalanced Datasets: Active learning algorithms give preference to samples with the most uncertainty. This approach directly focuses on areas where the model requires improvement, resulting in a more accurate and robust model. Additionally, active learning aids in crafting a well-balanced and representative training set by actively choosing diverse and informative examples. This particular capability is important for managing imbalanced datasets effectively.
  • Managing issues caused by Edge scenarios: Edge scenarios can range from harmless, like the classification model’s confusion between sloths and pastries in the example below, to very dangerous scenarios in drug discovery like showing unexpected adverse reactions observed in a small subset of patients during clinical trials. So lesser occurrences of the edge case failures can result in highly negative results with significant financial repercussions.

Source:

  • Robustness to Noisy data points: Active learning algorithms train on the most informative samples, ensuring they capture the essence of the entire dataset. As a result, models trained this way perform effectively on outliers along with the main data.

How does Active Learning work?

In a typical active learning process, the algorithm identifies the most valuable data points, which might include edge cases, and asks for them to be labeled by a human. These newly labeled data points are then incorporated into the training set, and the model undergoes retraining. This process of selecting data points is commonly referred to as "querying".

Process of training an active learning model:

  • The active learning algorithm evaluates the uncertainty of the current model or identifies areas where confidence is lacking.
  • Using the above evaluation, the algorithm creates new data points that are anticipated to offer the most useful insights for enhancing the model.
  • The newly generated data points are then given to human annotators for labeling since they are not part of the original dataset.
  • Following the labeling process, the freshly labeled data points are included in the training set, and the model undergoes retraining to integrate this updated information.

When to stop Active Learning:

When implementing Active Learning, it's crucial to determine the optimal stopping point before exhausting the budget, especially when model performance is near its peak. A stopping criterion, which monitors specific metrics against predefined conditions, is necessary.?

Stopping condition can be one among the below:

  • Performance on a development set is logical, but results can be unstable if the set is too small.
  • Comparing metrics to a predefined threshold, such as stopping when uncertainty or error is below this threshold.

Active Learning strategies:

Selective/Stream-Based Sampling:

Stream-based selective sampling involves the continuous input of unlabeled data into an active learning system. The system then uses a predefined learning strategy to determine whether to send the data to a human oracle for labeling. This approach is suitable for situations where the model is operational and the data sources or distributions change over time.

Pool-Based Sampling: This method involves establishing a pool of unlabeled data, from which the model identifies the most informative examples to be labeled either by an expert or a human annotator. Pool-based sampling encompasses various techniques such as uncertainty sampling, query-by-committee, and density-weighted sampling.

Synthesis Methods:

The core idea is that examples near the classification border are less clear-cut, so labeling them yields the most insights for the learning algorithm. We achieve this by pinpointing points close to the decision boundary by generating new ones or picking existing ones based on Euclidean distance methods.

Active learning strategies for subsampling:

Choosing an effective query strategy is crucial since it determines which data points are informative for labeling and subsequent training. The active learning pipeline's efficiency hinges on how swiftly the query strategy can identify the most impactful sample from the unlabeled dataset. A Few strategies used in active learning subsampling are the following:-

Uncertainty Based Sampling:

Uncertainty sampling is a query strategy designed to select samples that will most effectively reduce a model's uncertainty. This uncertainty is usually quantified using metrics like entropy ,least confident ,margin-based uncertainty. By focusing on samples with high uncertainty, the model can significantly improve its performance once these samples are labeled.

Source:

Heatmaps above show query behavior for uncertainty measures in a three-label classification. Simplex corners indicate high probability for one label, with the opposite edge showing probabilities for the other two. Dark red areas highlight the most informative query regions.

Diversity Sampling:

Diversity-weighted methods prioritize selecting examples for labeling based on their diversity within the current training set. This involves ranking the pool of unlabeled examples using a diversity measure, such as the dissimilarity between examples or the uncertainty in the model’s predictions.

Source:

Expected Error Reduction:

Expected error reduction is an active learning method that selects examples for labeling based on their potential to decrease the model's prediction error. The goal of this approach is to identify examples that, when labeled, will most significantly enhance the model's performance on new, unseen data.

In this method, unlabeled examples are ranked by estimating how much each example's labeling would reduce the model's prediction error. This estimation can be based on various criteria, such as the distance to the decision boundary, the margin between predicted labels, or the expected reduction in entropy.

Expected Model Change:

In this strategy, the algorithm selects the data instance that induces the greatest change in the model. This change can be quantified by measuring the gradient associated with that data instance during the stochastic gradient descent (SGD) process.

Source:

Query-by-committee:

This query strategy involves training multiple models on different subsets of the labeled dataset and selecting samples based on the level of disagreement among these models.?

This approach is particularly effective when the model tends to make errors on certain samples or classes. By choosing samples where the committee of models disagrees, the model can learn to better recognize these sample patterns, thereby enhancing its performance in those specific classes.

Applications of Active Learning in Drug Discovery:

Predictive Modeling:

Active Learning iteratively selects the most informative compounds for model training, improving the predictive power of models for various properties such as activity, toxicity, and pharmacokinetics, which reduces late-stage failures.

Ex:-? It is useful in reducing prediction uncertainty by actively selecting compounds where the model is most uncertain in any of tasks like cardio toxicity prediction, hepatotoxicity prediction and more, leading to more robust and reliable models with more accuracy.

Lead Optimization:

Active learning aids in selecting compounds that maximize information gain for understanding structure-activity relationship, helping to refine and improve lead compounds efficiently.

It helps in optimizing multiple properties (E.g.: potency, selectivity, solubility, and permeability) simultaneously by focusing on compounds that improve these properties.

Ex:- It is useful in cases like molecule optimization when given properties like solubility, CaCO2 permeability and more, to be satisfied. Then active learning helps to sample compounds that maximize information gain, which in turn generate molecules that satisfy given QSAR properties.

Chemical Space Exploration:

Active learning can guide the exploration of chemical space to discover novel compounds with desired biological activities, focusing on areas that are most likely to yield fruitful results.

By selecting diverse compounds, active learning ensures a broad exploration of chemical space, which is crucial for identifying unique scaffolds and avoiding redundancy.

Ex:- It helps in building docking surrogate models with fewer molecules but has knowledge of a huge library’s chemical space such as Enamine, Zinc, ChemSpace, etc to accurately predict binding affinity values for any set of given molecules.

Clinical Trial Design:

Active Learning can optimize patient selection for clinical trials by identifying the most informative patient subgroups, improving trial efficiency and success rates.

Ex:- It supports adaptive trial designs by continuously updating the trial based on incoming data to focus on the most promising treatment arms.

Conclusion:

Active learning, a machine learning technique gaining significant research interest particularly in natural language processing and computer vision, allows algorithms to iteratively select the most informative data points from a large unlabeled pool for human annotation. This targeted approach reduces annotation costs and improves model performance compared to traditional methods. However, implementing active learning can be challenging, as readily available solutions are scarce.? For those with the necessary expertise, developing a custom system based on existing frameworks is a viable option. Alternatively, commercially available tools offer a faster route to implementation, but often necessitate significant customization for specific use cases.

References:

[1]: B. Settles, Active Learning Literature Survey (2009), Computer Sciences Technical Report ? ? 1648, University of Wisconsin–Madison

[2]: Active Learning Sampling Strategies

[3]: Active Learning in Machine learning

https://boltzmann.co/post/3nkZOYruxSQXD7usLeGI

Author: ?Divyavani Darapuneni

要查看或添加评论,请登录

社区洞察

其他会员也浏览了