How Does Active Learning Machine Learning Work?
How can Machines Learn Better By Asking Questions? It allows machine learning models to choose which data points from a larger dataset they wish to have labeled, increasing the accuracy of predictions without requiring that every piece of data be classified.
In this article, we’ll explore the key components of active learning and how it differentiates from traditional approaches by using real-life data projects from interviews. But let’s get started with fundamentals.
What is Active Learning in Machine Learning?
Active learning is a specialized branch of machine learning. The model actively queries an oracle rather than passively training on pre-labeled data. Often, this oracle is a human expert who labels uncertain or ambiguous data points.
Interestingly, not all data points are equally helpful. That's the core idea behind active learning. The model achieves better results with fewer labeled examples by concentrating on the most difficult or uncertain instances.
This approach becomes more efficient and cost-effective, especially when labeling data is expensive or time-consuming.
Traditional Machine Learning: The Passive Approach
Traditional machine learning models depend heavily on large pre-labeled datasets for pattern recognition and prediction. In this typical setup, the model passively trains on whatever data it's given.
This often means requiring a considerable amount of labeled data for good performance. However, this can be extremely costly and time-consuming.
Active Learning: A Dynamic Alternative
Active learning changes that and takes a more fluid view of things.
Instead of feeding pre-labeled data to the model, the latter actively chooses which points require labels by judging its uncertainty. Active learning substantially decreases the number of labels needed while also improving accuracy by focusing on the most informational data.
This is invaluable and especially important when labeled data is in limited quantity and expensive.
Why Use Active Learning?
If gathering labeled data gets complex or expensive, active learning is a great tool to help.
Labeled data in many real-world scenarios might require nontrivial human effort and domain expertise or take too long to collect. You can't tag every data point.
This is precisely where active learning comes in. Choosing which examples to label can reduce the requirement based on large amounts of data labels.
The model training procedure is accelerated and appears to be even more high-performance. It also enhances model accuracy by addressing regions in which the model articulates with Quando se regression mais difícil.
These results in an intelligent utilization of time as well as resources.
Key Benefits of Active Learning in Machine Learning
Active learning has become the go-to choice in medical diagnosis, speech recognition, and personalized recommendations.
How Does Active Learning Work?
Active learning is based on a model training and validation loop that refines prediction accuracy by querying uncertain labels.
Decoding the Iterative Loop
It is something that usually goes down like this :
This iterative active learning course improves the model without being labeled entirely, significantly decreasing the data annotation cost.
Active Learning Strategies
Have you ever wondered what data points you should be hitting for labeling with your model? There are several active learning strategies to help you with making this critical decision.
These techniques are essential to improve the accuracy of your model while optimizing your resources. Circling back to critical techniques, first, let us glance over Uncertainty Sampling. Let us see this in action with a real-world data project.
Uncertainty Sampling
One of the most well-known strategies for active learning is perhaps Uncertainty Sampling.
With this method, ask your model for examples where it is least sure of its outcomes. It can then use this to focus (concentrate) on those uncertain data points, and hence, it will probably perform better even with lesser labeled samples.
Model Building on a Synthetic Dataset
In this section, we’ll use a data project. This data project is initially a take-home assignment for data science positions at Capital One.
The goal is to predict target values on the test set while optimizing predictive accuracy. Here is the link to this project: https://platform.stratascratch.com/data-projects/model-building-synthetic-dataset
Let's break down the code that implements Uncertainty Sampling on the provided dataset.
Here is the code.
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
train_data = pd.read_csv('codetest_train.txt', delimiter='\t')
test_data = pd.read_csv('codetest_test.txt', delimiter='\t')
X_train = train_data.drop(columns=['target']) # Assuming the last column is 'target'
y_train = train_data['target']
X_train = X_train.apply(pd.to_numeric, errors='coerce')
test_data_cleaned = test_data.apply(pd.to_numeric, errors='coerce')
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
test_data_imputed = imputer.transform(test_data_cleaned)
X_initial, X_pool, y_initial, y_pool = train_test_split(X_train_imputed, y_train, test_size=0.8, random_state=42)
initial_model = DecisionTreeRegressor(max_depth=5, random_state=42)
initial_model.fit(X_initial, y_initial)
pool_predictions = initial_model.predict(X_pool)
uncertain_indices = np.argsort(np.abs(pool_predictions - np.mean(pool_predictions)))[:100] # Select top 100 uncertain samples
X_uncertain = X_pool[uncertain_indices]
y_uncertain = y_pool.iloc[uncertain_indices]
X_train_updated = np.vstack([X_initial, X_uncertain])
y_train_updated = np.concatenate([y_initial, y_uncertain])
updated_model = DecisionTreeRegressor(max_depth=5, random_state=42)
updated_model.fit(X_train_updated, y_train_updated)
val_predictions = updated_model.predict(X_pool)
mae = mean_absolute_error(y_pool, val_predictions)
print(f"Updated MAE after uncertainty sampling: {mae}")
test_predictions = updated_model.predict(test_data_imputed)
with open('test_predictions_uncertainty_sampling.txt', 'w') as f:
for prediction in test_predictions:
f.write(f"{prediction}\n")
print("First 5 test predictions: ", test_predictions[:5])
Here is the output.
In this implementation, we see how Uncertainty Sampling helps the model become more accurate by focusing on the most ambiguous data points.?
The mean absolute error (MAE) reflects the model’s performance after applying this strategy, and we also generate predictions for the test set to evaluate the final result.
Query-By-Committee
Query-By-Committee (QBC) is another popular strategy in active learning in machine learning. In this approach, multiple models (a "committee") are trained on the same labeled data, and their predictions on the unlabeled data are compared.
The data points the committee disagrees with the most are selected for labeling. This disagreement, or variance among predictions, highlights the uncertainty, making it a powerful tool for improving model performance.
We’ll continue using the synthetic dataset from the earlier project. In this case, we’ll use three different models—a decision tree, a random forest, and a linear regression model—to form the committee. The models will be trained on an initially labeled dataset, and we will use the disagreement among their predictions to select new samples for labeling.
Here’s a breakdown of the code that implements the Query-By-Committee strategy:
Here is the code.
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
train_data = pd.read_csv('codetest_train.txt', delimiter='\t')
test_data = pd.read_csv('codetest_test.txt', delimiter='\t')
X_train = train_data.drop(columns=['target']) # Assuming the last column is 'target'
y_train = train_data['target']
X_train = X_train.apply(pd.to_numeric, errors='coerce')
test_data_cleaned = test_data.apply(pd.to_numeric, errors='coerce')
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
test_data_imputed = imputer.transform(test_data_cleaned)
gf
X_initial, X_pool, y_initial, y_pool = train_test_split(X_train_imputed, y_train, test_size=0.8, random_state=42)
model_1 = DecisionTreeRegressor(max_depth=5, random_state=42)
model_2 = RandomForestRegressor(n_estimators=10, random_state=42)
model_3 = LinearRegression()
model_1.fit(X_initial, y_initial)
model_2.fit(X_initial, y_initial)
model_3.fit(X_initial, y_initial)
pred_1 = model_1.predict(X_pool)
pred_2 = model_2.predict(X_pool)
pred_3 = model_3.predict(X_pool)
predictions_stack = np.vstack([pred_1, pred_2, pred_3])
variance = np.var(predictions_stack, axis=0)
uncertain_indices = np.argsort(variance)[-100:] # Top 100 most uncertain
X_uncertain = X_pool[uncertain_indices]
y_uncertain = y_pool.iloc[uncertain_indices]
X_train_updated = np.vstack([X_initial, X_uncertain])
y_train_updated = np.concatenate([y_initial, y_uncertain])
model_1.fit(X_train_updated, y_train_updated)
model_2.fit(X_train_updated, y_train_updated)
model_3.fit(X_train_updated, y_train_updated)
val_predictions_1 = model_1.predict(X_pool)
val_predictions_2 = model_2.predict(X_pool)
val_predictions_3 = model_3.predict(X_pool)
final_predictions = (val_predictions_1 + val_predictions_2 + val_predictions_3) / 3
mae = mean_absolute_error(y_pool, final_predictions)
print(f"Updated MAE after Query-By-Committee: {mae}")
test_predictions_1 = model_1.predict(test_data_imputed)
test_predictions_2 = model_2.predict(test_data_imputed)
test_predictions_3 = model_3.predict(test_data_imputed)
test_predictions_final = (test_predictions_1 + test_predictions_2 + test_predictions_3) / 3
with open('test_predictions_qbc.txt', 'w') as f:
for prediction in test_predictions_final:
f.write(f"{prediction}\n")
print("First 5 test predictions: ", test_predictions_final[:5])
Here is the output.
In this implementation, the Query-By-Committee strategy highlights how disagreement among the committee members can help identify the most uncertain data points. We improve the overall accuracy by retraining the models using these uncertain samples. The mean absolute error (MAE) helps measure the updated model's performance.
Expected Model Change
In the Expected Model Change strategy, the model selects data points that, when labeled, are expected to lead to the most significant change in the model’s predictions. This is typically done by estimating how much the model's predictions on other data points will shift after training on the new, labeled samples.
For this implementation, we’ll use the synthetic dataset and measure the expected change in the model by adding one data point at a time from the pool, retraining the model, and then observing how much the predictions change for the other samples.
Code Walkthrough
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
train_data = pd.read_csv('codetest_train.txt', delimiter='\t')
test_data = pd.read_csv('codetest_test.txt', delimiter='\t')
X_train = train_data.drop(columns=['target']) # Assuming the last column is 'target'
y_train = train_data['target']
X_train = X_train.apply(pd.to_numeric, errors='coerce')
test_data_cleaned = test_data.apply(pd.to_numeric, errors='coerce')
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
test_data_imputed = imputer.transform(test_data_cleaned)
X_initial, X_pool, y_initial, y_pool = train_test_split(X_train_imputed, y_train, test_size=0.8, random_state=42)
initial_model = DecisionTreeRegressor(max_depth=5, random_state=42)
initial_model.fit(X_initial, y_initial)
initial_predictions = initial_model.predict(X_pool)
n_samples = X_pool.shape[0]
subset_size = 500 # Define a subset size for faster computation (you can adjust this)
sample_indices = np.random.choice(np.arange(n_samples), size=subset_size, replace=False)
expected_changes = []
for i in sample_indices:
# Temporarily add the i-th pool sample to the training data
X_temp = np.vstack([X_initial, X_pool[i]])
y_temp = np.append(y_initial, y_pool.iloc[i])
# Retrain the model on this temporary dataset
temp_model = DecisionTreeRegressor(max_depth=5, random_state=42)
temp_model.fit(X_temp, y_temp)
# Get new predictions on the pool
new_predictions = temp_model.predict(X_pool)
# Calculate the change in predictions
change = np.mean(np.abs(new_predictions - initial_predictions))
expected_changes.append((i, change))
sorted_samples = sorted(expected_changes, key=lambda x: x[1], reverse=True)
uncertain_indices = [sample[0] for sample in sorted_samples[:100]]
X_uncertain = X_pool[uncertain_indices]
y_uncertain = y_pool.iloc[uncertain_indices]
X_train_updated = np.vstack([X_initial, X_uncertain])
y_train_updated = np.concatenate([y_initial, y_uncertain])
updated_model = DecisionTreeRegressor(max_depth=5, random_state=42)
updated_model.fit(X_train_updated, y_train_updated)
updated_predictions = updated_model.predict(X_pool)
mae = mean_absolute_error(y_pool, updated_predictions)
print(f"Updated MAE after Expected Model Change (subset sampling): {mae}")
test_predictions = updated_model.predict(test_data_imputed)
with open('test_predictions_expected_model_change_optimized.txt', 'w') as f:
for prediction in test_predictions:
f.write(f"{prediction}\n")
print("First 5 test predictions: ", test_predictions[:5])
Here is the output.
Types of Active Learning
In this section, we will explore three different types of active learning in machine learning: Pool-Based Sampling, Stream-Based Sampling, and Membership Query Synthesis. We will demonstrate each using the Predicting Price dataset.
For each strategy, we will:
Also, for this section, we’ll use this Data project.
Predicting Price
This data project aims to predict price. Haensel AMS used this project during the recruitment process. Here is the link to this Data Project: https://platform.stratascratch.com/data-projects/predicting-price
Let’s clean this dataset before applying the type of active learning.
Data Cleaning and Preparation
Before applying any active learning strategy, we first clean and prepare the dataset.
import pandas as pd
df = pd.read_csv("/mnt/data/sample.csv")
df = df[(df["loc1"].str.contains("S") == False) & (df["loc1"].str.contains("T") == False)]
df["loc2"] = pd.to_numeric(df["loc2"], errors='coerce')
df["loc1"] = pd.to_numeric(df["loc1"], errors='coerce')
df.dropna(inplace=True)
dow_dummies = pd.get_dummies(df['dow'])
df = df.drop(columns='dow').join(dow_dummies)
df.drop(columns=['loc2'], inplace=True)
df.head()
Now that the dataset is cleaned and prepared, let’s apply each active learning strategy and measure how much the model has improved.
Pool-Based Sampling
Pool-based sampling selects the most uncertain samples from a large pool of unlabeled data. The model is initially trained on a small labeled dataset, and we query the most uncertain samples to improve performance.
Code Explanation
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X = df.drop(columns='price')
y = df['price']
X_initial, X_pool, y_initial, y_pool = train_test_split(X, y, test_size=0.8, random_state=42)
model = RandomForestRegressor(random_state=42)
model.fit(X_initial, y_initial)
y_pred_initial = model.predict(X_pool)
mse_initial = mean_squared_error(y_pool, y_pred_initial)
print(f"Initial MSE (Pool-Based Sampling): {mse_initial}")
proba_pool = model.predict(X_pool)
uncertainty = abs(y_pool - proba_pool) # Uncertainty as the absolute error in predictions
top_uncertain_indices = uncertainty.nlargest(100).index
X_train_updated = pd.concat([X_initial, X_pool.loc[top_uncertain_indices]])
y_train_updated = pd.concat([y_initial, y_pool.loc[top_uncertain_indices]])
model.fit(X_train_updated, y_train_updated)
y_pred_updated = model.predict(X_pool)
mse_updated = mean_squared_error(y_pool, y_pred_updated)
print(f"Updated MSE after Pool-Based Sampling: {mse_updated}")
improvement_percentage = ((mse_initial - mse_updated) / mse_initial) * 100
print(f"Percentage Improvement after Pool-Based Sampling: {improvement_percentage:.2f}%")
Here is the output.
Evaluation:
Stream-Based Sampling
In Stream-Based Sampling, data points arrive in a stream, and the model decides whether to label each point as it comes. The decision is typically based on the uncertainty of the model’s predictions.?
We’ll apply code, but here is the explanation of this simply at first.
Here is the code.
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_initial_scaled, X_pool_scaled, y_initial, y_pool = train_test_split(X_scaled, y, test_size=0.8, random_state=42)
model = LinearRegression()
model.fit(X_initial_scaled, y_initial)
y_pred_synth_initial = model.predict(X_pool_scaled)
mse_synth_initial = mean_squared_error(y_pool, y_pred_synth_initial)
print(f"Initial MSE (Membership Query Synthesis): {mse_synth_initial}")
n_samples = 100
mean = X_initial_scaled.mean(axis=0)
std = X_initial_scaled.std(axis=0)
X_synth = np.random.normal(loc=mean, scale=std, size=(n_samples, X_initial_scaled.shape[1]))
y_synth = model.predict(X_synth) # Predict synthetic labels using the initial model
X_combined_scaled = np.vstack([X_initial_scaled, X_synth])
y_combined = np.concatenate([y_initial, y_synth])
model.fit(X_combined_scaled, y_combined)
y_pred_synth_updated = model.predict(X_pool_scaled)
mse_synth_updated = mean_squared_error(y_pool, y_pred_synth_updated)
print(f"Updated MSE after Membership Query Synthesis: {mse_synth_updated}")
synth_improvement_percentage = ((mse_synth_initial - mse_synth_updated) / mse_synth_initial) * 100
print(f"Percentage Improvement after Membership Query Synthesis: {synth_improvement_percentage:.2f}%")
Here is the output.
Membership Query Synthesis
In Membership Query Synthesis, the model generates new synthetic data points and queries their labels. This strategy is proper when real-world data is limited.
Here is the explanation of the code that we’ll apply in a bit.
Here is the code.
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_initial_scaled, X_pool_scaled, y_initial, y_pool = train_test_split(X_scaled, y, test_size=0.8, random_state=42)
model = LinearRegression()
model.fit(X_initial_scaled, y_initial)
y_pred_synth_initial = model.predict(X_pool_scaled)
mse_synth_initial = mean_squared_error(y_pool, y_pred_synth_initial)
print(f"Initial MSE (Membership Query Synthesis): {mse_synth_initial}")
n_samples = 100
mean = X_initial_scaled.mean(axis=0)
std = X_initial_scaled.std(axis=0)
X_synth = np.random.normal(loc=mean, scale=std, size=(n_samples, X_initial_scaled.shape[1]))
y_synth = model.predict(X_synth) # Predict synthetic labels using the initial model
X_combined_scaled = np.vstack([X_initial_scaled, X_synth])
y_combined = np.concatenate([y_initial, y_synth])
model.fit(X_combined_scaled, y_combined)
y_pred_synth_updated = model.predict(X_pool_scaled)
mse_synth_updated = mean_squared_error(y_pool, y_pred_synth_updated)
print(f"Updated MSE after Membership Query Synthesis: {mse_synth_updated}")
synth_improvement_percentage = ((mse_synth_initial - mse_synth_updated) / mse_synth_initial) * 100
print(f"Percentage Improvement after Membership Query Synthesis: {synth_improvement_percentage:.2f}%")
Here is the output.?
Conclusion
In this article, we have explored how active learning can improve machine learning models using different real-life data projects. These projects are important because they have already been used during interviews. Applying your knowledge using real-life data projects will get you to where you want to be.