How Does Active Learning Machine Learning Work?

How Does Active Learning Machine Learning Work?

How can Machines Learn Better By Asking Questions? It allows machine learning models to choose which data points from a larger dataset they wish to have labeled, increasing the accuracy of predictions without requiring that every piece of data be classified.

In this article, we’ll explore the key components of active learning and how it differentiates from traditional approaches by using real-life data projects from interviews. But let’s get started with fundamentals.


What is Active Learning in Machine Learning?

Active learning is a specialized branch of machine learning. The model actively queries an oracle rather than passively training on pre-labeled data. Often, this oracle is a human expert who labels uncertain or ambiguous data points.

Interestingly, not all data points are equally helpful. That's the core idea behind active learning. The model achieves better results with fewer labeled examples by concentrating on the most difficult or uncertain instances.

This approach becomes more efficient and cost-effective, especially when labeling data is expensive or time-consuming.


Traditional Machine Learning: The Passive Approach

Traditional machine learning models depend heavily on large pre-labeled datasets for pattern recognition and prediction. In this typical setup, the model passively trains on whatever data it's given.

This often means requiring a considerable amount of labeled data for good performance. However, this can be extremely costly and time-consuming.


Active Learning: A Dynamic Alternative

Active learning changes that and takes a more fluid view of things.

Instead of feeding pre-labeled data to the model, the latter actively chooses which points require labels by judging its uncertainty. Active learning substantially decreases the number of labels needed while also improving accuracy by focusing on the most informational data.

This is invaluable and especially important when labeled data is in limited quantity and expensive.


Why Use Active Learning?

If gathering labeled data gets complex or expensive, active learning is a great tool to help.

Labeled data in many real-world scenarios might require nontrivial human effort and domain expertise or take too long to collect. You can't tag every data point.

This is precisely where active learning comes in. Choosing which examples to label can reduce the requirement based on large amounts of data labels.

The model training procedure is accelerated and appears to be even more high-performance. It also enhances model accuracy by addressing regions in which the model articulates with Quando se regression mais difícil.

These results in an intelligent utilization of time as well as resources.


Key Benefits of Active Learning in Machine Learning


  • Reduces labeling costs
  • Accelerates training
  • Enhances accuracy
  • Optimizes resource use

Active learning has become the go-to choice in medical diagnosis, speech recognition, and personalized recommendations.


How Does Active Learning Work?

Active learning is based on a model training and validation loop that refines prediction accuracy by querying uncertain labels.


Decoding the Iterative Loop

It is something that usually goes down like this :

  1. Initial Model Training The first step is to train a model on some small labeled dataset, which could serve as groundwork for predictions later.
  2. Encountering Unlabeled Data After it is given the pool of unlabeled data, it intuitively identifies where labeling would be of most help or where uncertainty is highest instead of marking everything.
  3. Querying an Oracle Given this uncertainty, the model consults an oracle—often you or another human annotator—to label specific data points and, in doing so, focuses on the most informative.
  4. Updating the Model If the model is optimistic about these labels, it will retrain on newly labeled data. This narrows the field of possible predictions, making them more precise and less uncertain.
  5. Repeating the Cycle This continues until the maximum accuracy is reached or manual labeling becomes too expensive.


This iterative active learning course improves the model without being labeled entirely, significantly decreasing the data annotation cost.


Active Learning Strategies

Have you ever wondered what data points you should be hitting for labeling with your model? There are several active learning strategies to help you with making this critical decision.

These techniques are essential to improve the accuracy of your model while optimizing your resources. Circling back to critical techniques, first, let us glance over Uncertainty Sampling. Let us see this in action with a real-world data project.


Uncertainty Sampling

One of the most well-known strategies for active learning is perhaps Uncertainty Sampling.

With this method, ask your model for examples where it is least sure of its outcomes. It can then use this to focus (concentrate) on those uncertain data points, and hence, it will probably perform better even with lesser labeled samples.


Model Building on a Synthetic Dataset

In this section, we’ll use a data project. This data project is initially a take-home assignment for data science positions at Capital One.

The goal is to predict target values on the test set while optimizing predictive accuracy. Here is the link to this project: https://platform.stratascratch.com/data-projects/model-building-synthetic-dataset


Let's break down the code that implements Uncertainty Sampling on the provided dataset.

  1. Loading the Data
  2. Handling Missing Data
  3. Initial Model Training
  4. Uncertainty Sampling
  5. Retraining the Model
  6. Model Evaluation


Here is the code.

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error

train_data = pd.read_csv('codetest_train.txt', delimiter='\t')
test_data = pd.read_csv('codetest_test.txt', delimiter='\t')

X_train = train_data.drop(columns=['target'])  # Assuming the last column is 'target'
y_train = train_data['target']

X_train = X_train.apply(pd.to_numeric, errors='coerce')
test_data_cleaned = test_data.apply(pd.to_numeric, errors='coerce')

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
test_data_imputed = imputer.transform(test_data_cleaned)

X_initial, X_pool, y_initial, y_pool = train_test_split(X_train_imputed, y_train, test_size=0.8, random_state=42)

initial_model = DecisionTreeRegressor(max_depth=5, random_state=42)
initial_model.fit(X_initial, y_initial)

pool_predictions = initial_model.predict(X_pool)

uncertain_indices = np.argsort(np.abs(pool_predictions - np.mean(pool_predictions)))[:100]  # Select top 100 uncertain samples

X_uncertain = X_pool[uncertain_indices]
y_uncertain = y_pool.iloc[uncertain_indices]

X_train_updated = np.vstack([X_initial, X_uncertain])
y_train_updated = np.concatenate([y_initial, y_uncertain])

updated_model = DecisionTreeRegressor(max_depth=5, random_state=42)
updated_model.fit(X_train_updated, y_train_updated)

val_predictions = updated_model.predict(X_pool)
mae = mean_absolute_error(y_pool, val_predictions)
print(f"Updated MAE after uncertainty sampling: {mae}")

test_predictions = updated_model.predict(test_data_imputed)

with open('test_predictions_uncertainty_sampling.txt', 'w') as f:
    for prediction in test_predictions:
        f.write(f"{prediction}\n")

print("First 5 test predictions: ", test_predictions[:5])        

Here is the output.

In this implementation, we see how Uncertainty Sampling helps the model become more accurate by focusing on the most ambiguous data points.?

The mean absolute error (MAE) reflects the model’s performance after applying this strategy, and we also generate predictions for the test set to evaluate the final result.


Query-By-Committee

Query-By-Committee (QBC) is another popular strategy in active learning in machine learning. In this approach, multiple models (a "committee") are trained on the same labeled data, and their predictions on the unlabeled data are compared.

The data points the committee disagrees with the most are selected for labeling. This disagreement, or variance among predictions, highlights the uncertainty, making it a powerful tool for improving model performance.

We’ll continue using the synthetic dataset from the earlier project. In this case, we’ll use three different models—a decision tree, a random forest, and a linear regression model—to form the committee. The models will be trained on an initially labeled dataset, and we will use the disagreement among their predictions to select new samples for labeling.


Here’s a breakdown of the code that implements the Query-By-Committee strategy:

  1. Loading the Data
  2. Handling Missing Data.
  3. Initial Training
  4. Query-By-Committee
  5. Retraining the Models
  6. Model Evaluation


Here is the code.

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error

train_data = pd.read_csv('codetest_train.txt', delimiter='\t')
test_data = pd.read_csv('codetest_test.txt', delimiter='\t')

X_train = train_data.drop(columns=['target'])  # Assuming the last column is 'target'
y_train = train_data['target']

X_train = X_train.apply(pd.to_numeric, errors='coerce')
test_data_cleaned = test_data.apply(pd.to_numeric, errors='coerce')

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
test_data_imputed = imputer.transform(test_data_cleaned)
gf
X_initial, X_pool, y_initial, y_pool = train_test_split(X_train_imputed, y_train, test_size=0.8, random_state=42)

model_1 = DecisionTreeRegressor(max_depth=5, random_state=42)
model_2 = RandomForestRegressor(n_estimators=10, random_state=42)
model_3 = LinearRegression()

model_1.fit(X_initial, y_initial)
model_2.fit(X_initial, y_initial)
model_3.fit(X_initial, y_initial)

pred_1 = model_1.predict(X_pool)
pred_2 = model_2.predict(X_pool)
pred_3 = model_3.predict(X_pool)

predictions_stack = np.vstack([pred_1, pred_2, pred_3])
variance = np.var(predictions_stack, axis=0)

uncertain_indices = np.argsort(variance)[-100:]  # Top 100 most uncertain

X_uncertain = X_pool[uncertain_indices]
y_uncertain = y_pool.iloc[uncertain_indices]

X_train_updated = np.vstack([X_initial, X_uncertain])
y_train_updated = np.concatenate([y_initial, y_uncertain])

model_1.fit(X_train_updated, y_train_updated)
model_2.fit(X_train_updated, y_train_updated)
model_3.fit(X_train_updated, y_train_updated)

val_predictions_1 = model_1.predict(X_pool)
val_predictions_2 = model_2.predict(X_pool)
val_predictions_3 = model_3.predict(X_pool)

final_predictions = (val_predictions_1 + val_predictions_2 + val_predictions_3) / 3

mae = mean_absolute_error(y_pool, final_predictions)
print(f"Updated MAE after Query-By-Committee: {mae}")

test_predictions_1 = model_1.predict(test_data_imputed)
test_predictions_2 = model_2.predict(test_data_imputed)
test_predictions_3 = model_3.predict(test_data_imputed)

test_predictions_final = (test_predictions_1 + test_predictions_2 + test_predictions_3) / 3

with open('test_predictions_qbc.txt', 'w') as f:
    for prediction in test_predictions_final:
        f.write(f"{prediction}\n")

print("First 5 test predictions: ", test_predictions_final[:5])        

Here is the output.

In this implementation, the Query-By-Committee strategy highlights how disagreement among the committee members can help identify the most uncertain data points. We improve the overall accuracy by retraining the models using these uncertain samples. The mean absolute error (MAE) helps measure the updated model's performance.


Expected Model Change

In the Expected Model Change strategy, the model selects data points that, when labeled, are expected to lead to the most significant change in the model’s predictions. This is typically done by estimating how much the model's predictions on other data points will shift after training on the new, labeled samples.

For this implementation, we’ll use the synthetic dataset and measure the expected change in the model by adding one data point at a time from the pool, retraining the model, and then observing how much the predictions change for the other samples.


Code Walkthrough

  1. Loading and Preprocessing:
  2. Initial Training
  3. Expected Model Change
  4. Retraining



import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error

train_data = pd.read_csv('codetest_train.txt', delimiter='\t')
test_data = pd.read_csv('codetest_test.txt', delimiter='\t')

X_train = train_data.drop(columns=['target'])  # Assuming the last column is 'target'
y_train = train_data['target']

X_train = X_train.apply(pd.to_numeric, errors='coerce')
test_data_cleaned = test_data.apply(pd.to_numeric, errors='coerce')

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
test_data_imputed = imputer.transform(test_data_cleaned)

X_initial, X_pool, y_initial, y_pool = train_test_split(X_train_imputed, y_train, test_size=0.8, random_state=42)

initial_model = DecisionTreeRegressor(max_depth=5, random_state=42)
initial_model.fit(X_initial, y_initial)

initial_predictions = initial_model.predict(X_pool)

n_samples = X_pool.shape[0]
subset_size = 500  # Define a subset size for faster computation (you can adjust this)
sample_indices = np.random.choice(np.arange(n_samples), size=subset_size, replace=False)

expected_changes = []

for i in sample_indices:
    # Temporarily add the i-th pool sample to the training data
    X_temp = np.vstack([X_initial, X_pool[i]])
    y_temp = np.append(y_initial, y_pool.iloc[i])
    
    # Retrain the model on this temporary dataset
    temp_model = DecisionTreeRegressor(max_depth=5, random_state=42)
    temp_model.fit(X_temp, y_temp)
    
    # Get new predictions on the pool
    new_predictions = temp_model.predict(X_pool)
    
    # Calculate the change in predictions
    change = np.mean(np.abs(new_predictions - initial_predictions))
    expected_changes.append((i, change))

sorted_samples = sorted(expected_changes, key=lambda x: x[1], reverse=True)

uncertain_indices = [sample[0] for sample in sorted_samples[:100]]

X_uncertain = X_pool[uncertain_indices]
y_uncertain = y_pool.iloc[uncertain_indices]

X_train_updated = np.vstack([X_initial, X_uncertain])
y_train_updated = np.concatenate([y_initial, y_uncertain])

updated_model = DecisionTreeRegressor(max_depth=5, random_state=42)
updated_model.fit(X_train_updated, y_train_updated)

updated_predictions = updated_model.predict(X_pool)
mae = mean_absolute_error(y_pool, updated_predictions)
print(f"Updated MAE after Expected Model Change (subset sampling): {mae}")

test_predictions = updated_model.predict(test_data_imputed)

with open('test_predictions_expected_model_change_optimized.txt', 'w') as f:
    for prediction in test_predictions:
        f.write(f"{prediction}\n")

print("First 5 test predictions: ", test_predictions[:5])        

Here is the output.

Types of Active Learning

In this section, we will explore three different types of active learning in machine learning: Pool-Based Sampling, Stream-Based Sampling, and Membership Query Synthesis. We will demonstrate each using the Predicting Price dataset.


For each strategy, we will:

  1. Explain the purpose of the strategy.
  2. Walk through the code implementation.
  3. Evaluate the model improvement based on the percentage drop in MSE (Mean Squared Error) before and after applying the strategy.


Also, for this section, we’ll use this Data project.


Predicting Price

This data project aims to predict price. Haensel AMS used this project during the recruitment process. Here is the link to this Data Project: https://platform.stratascratch.com/data-projects/predicting-price

Let’s clean this dataset before applying the type of active learning.


Data Cleaning and Preparation

Before applying any active learning strategy, we first clean and prepare the dataset.

import pandas as pd

df = pd.read_csv("/mnt/data/sample.csv")

df = df[(df["loc1"].str.contains("S") == False) & (df["loc1"].str.contains("T") == False)]

df["loc2"] = pd.to_numeric(df["loc2"], errors='coerce')
df["loc1"] = pd.to_numeric(df["loc1"], errors='coerce')
df.dropna(inplace=True)

dow_dummies = pd.get_dummies(df['dow'])
df = df.drop(columns='dow').join(dow_dummies)

df.drop(columns=['loc2'], inplace=True)

df.head()        

Now that the dataset is cleaned and prepared, let’s apply each active learning strategy and measure how much the model has improved.


Pool-Based Sampling

Pool-based sampling selects the most uncertain samples from a large pool of unlabeled data. The model is initially trained on a small labeled dataset, and we query the most uncertain samples to improve performance.


Code Explanation

  1. Initial Model Training
  2. Uncertainty Sampling
  3. Model Retraining

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X = df.drop(columns='price')
y = df['price']

X_initial, X_pool, y_initial, y_pool = train_test_split(X, y, test_size=0.8, random_state=42)

model = RandomForestRegressor(random_state=42)
model.fit(X_initial, y_initial)

y_pred_initial = model.predict(X_pool)
mse_initial = mean_squared_error(y_pool, y_pred_initial)
print(f"Initial MSE (Pool-Based Sampling): {mse_initial}")

proba_pool = model.predict(X_pool)
uncertainty = abs(y_pool - proba_pool)  # Uncertainty as the absolute error in predictions
top_uncertain_indices = uncertainty.nlargest(100).index

X_train_updated = pd.concat([X_initial, X_pool.loc[top_uncertain_indices]])
y_train_updated = pd.concat([y_initial, y_pool.loc[top_uncertain_indices]])

model.fit(X_train_updated, y_train_updated)

y_pred_updated = model.predict(X_pool)
mse_updated = mean_squared_error(y_pool, y_pred_updated)
print(f"Updated MSE after Pool-Based Sampling: {mse_updated}")

improvement_percentage = ((mse_initial - mse_updated) / mse_initial) * 100
print(f"Percentage Improvement after Pool-Based Sampling: {improvement_percentage:.2f}%")        

Here is the output.

Evaluation:

  • Initial MSE: The model's performance on the pool before querying uncertain samples.
  • Updated MSE: The model's performance after labeling the uncertain samples.
  • Improvement: We calculate the percentage reduction in MSE to measure model improvement.


Stream-Based Sampling

In Stream-Based Sampling, data points arrive in a stream, and the model decides whether to label each point as it comes. The decision is typically based on the uncertainty of the model’s predictions.?


We’ll apply code, but here is the explanation of this simply at first.

  1. Initial Model Training
  2. Streaming Data
  3. Threshold-Based Querying


Here is the code.

from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_initial_scaled, X_pool_scaled, y_initial, y_pool = train_test_split(X_scaled, y, test_size=0.8, random_state=42)

model = LinearRegression()
model.fit(X_initial_scaled, y_initial)

y_pred_synth_initial = model.predict(X_pool_scaled)
mse_synth_initial = mean_squared_error(y_pool, y_pred_synth_initial)
print(f"Initial MSE (Membership Query Synthesis): {mse_synth_initial}")

n_samples = 100
mean = X_initial_scaled.mean(axis=0)
std = X_initial_scaled.std(axis=0)

X_synth = np.random.normal(loc=mean, scale=std, size=(n_samples, X_initial_scaled.shape[1]))
y_synth = model.predict(X_synth)  # Predict synthetic labels using the initial model

X_combined_scaled = np.vstack([X_initial_scaled, X_synth])
y_combined = np.concatenate([y_initial, y_synth])

model.fit(X_combined_scaled, y_combined)


y_pred_synth_updated = model.predict(X_pool_scaled)
mse_synth_updated = mean_squared_error(y_pool, y_pred_synth_updated)
print(f"Updated MSE after Membership Query Synthesis: {mse_synth_updated}")

synth_improvement_percentage = ((mse_synth_initial - mse_synth_updated) / mse_synth_initial) * 100
print(f"Percentage Improvement after Membership Query Synthesis: {synth_improvement_percentage:.2f}%")        

Here is the output.

  • Initial MSE: The model's performance before querying labels from the stream.
  • Updated MSE: The model's performance after labeling is based on uncertainty from the stream.
  • Improvement: We calculate the percentage drop in MSE.


Membership Query Synthesis

In Membership Query Synthesis, the model generates new synthetic data points and queries their labels. This strategy is proper when real-world data is limited.

Here is the explanation of the code that we’ll apply in a bit.

  1. Synthetic Data Generation
  2. Model Retraining:
  3. Evaluation


Here is the code.

from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_initial_scaled, X_pool_scaled, y_initial, y_pool = train_test_split(X_scaled, y, test_size=0.8, random_state=42)

model = LinearRegression()
model.fit(X_initial_scaled, y_initial)

y_pred_synth_initial = model.predict(X_pool_scaled)
mse_synth_initial = mean_squared_error(y_pool, y_pred_synth_initial)
print(f"Initial MSE (Membership Query Synthesis): {mse_synth_initial}")

n_samples = 100
mean = X_initial_scaled.mean(axis=0)
std = X_initial_scaled.std(axis=0)

X_synth = np.random.normal(loc=mean, scale=std, size=(n_samples, X_initial_scaled.shape[1]))
y_synth = model.predict(X_synth)  # Predict synthetic labels using the initial model

X_combined_scaled = np.vstack([X_initial_scaled, X_synth])
y_combined = np.concatenate([y_initial, y_synth])

model.fit(X_combined_scaled, y_combined)
y_pred_synth_updated = model.predict(X_pool_scaled)
mse_synth_updated = mean_squared_error(y_pool, y_pred_synth_updated)
print(f"Updated MSE after Membership Query Synthesis: {mse_synth_updated}")

synth_improvement_percentage = ((mse_synth_initial - mse_synth_updated) / mse_synth_initial) * 100
print(f"Percentage Improvement after Membership Query Synthesis: {synth_improvement_percentage:.2f}%")        

Here is the output.?

  • Initial MSE: The model's performance before adding synthetic data.
  • Updated MSE: The model's performance after adding synthetic data.
  • Improvement: We calculate the percentage drop in MSE.


Conclusion

In this article, we have explored how active learning can improve machine learning models using different real-life data projects. These projects are important because they have already been used during interviews. Applying your knowledge using real-life data projects will get you to where you want to be.

要查看或添加评论,请登录

社区洞察