"Dynamic Approach to Tackling Multicollinearity & Overfitting with Lasso Regression: Real-Time Health Data Insights"
Vaidyanathan Ravichandran
Professor of Practice (Finance) - Business Schools , Bangalore
Brief on Regularization on Coefficients in Regression Models
Regularization is a crucial technique in regression models designed to tackle the problems of overfitting and multicollinearity.
By adding a penalty term to the regression cost function, regularization discourages the model from becoming overly complex, thereby promoting simplicity and generalizability. This becomes particularly important when working with high-dimensional data, where many features might contribute little to the model and cause overfitting.
What Causes Overfitting?
How to Mitigate Overfitting:
Types of Regularization
1/ L1 Regularization (Lasso)
2/ L2 Regularization (Ridge)
Refer previous article -Ridge Regression - on Linkedin : https://www.dhirubhai.net/pulse/mitigating-multicollinearity-health-data-ridge-ravichandran-c2flc/?trackingId=8BFwhWcpQdiMuul1l9rk%2BQ%3D%3D
3/ Elastic Net
Use Case: Elastic Net is useful when both irrelevant features and multicollinearity are concerns. It strikes a balance between Lasso's feature selection and Ridge's regularization of all features.
Benefits of Regularization
Elastic Net: A Hybrid Approach
Elastic Net combines the strengths of both Lasso and Ridge regression. It applies both L1 and L2 penalties, thus achieving feature selection (Lasso’s strength) while maintaining some flexibility by shrinking coefficients rather than eliminating them entirely (Ridge’s strength). Elastic Net is useful when you want to combine the benefits of both methods, particularly in datasets with many highly correlated variables.
Introduction to Lasso Regression
Lasso Regression (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that introduces regularization in the model. This method is especially useful when the dataset contains multiple features, some of which may not contribute significantly to the model. By applying a penalty to the regression coefficients, Lasso encourages simpler models that avoid overfitting and reduce multicollinearity, ultimately improving generalization to new data.
The Need for Lasso Regression
As datasets grow in complexity, they often contain a large number of features (variables), many of which may be irrelevant or redundant. Traditional Ordinary Least Squares (OLS) regression models may overfit the data, meaning they perform well on the training data but poorly on unseen test data. Overfitting becomes more problematic when the features are highly correlated (multicollinearity), leading to unstable and unreliable coefficients.
Here are the main reasons for using Lasso regression:
Understanding Multicollinearity
Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. This can lead to inflated standard errors for the regression coefficients, making it difficult to assess the effect of each independent variable on the dependent variable.
Effects of Multicollinearity:
Detecting Multicollinearity:
领英推荐
Steps to Apply Lasso Regression
Lasso vs. Ridge Regression
Both Lasso and Ridge are regularization techniques, but they differ in how they penalize the regression coefficients.
When to Use Lasso or Ridge Regression?
Lasso Regression is a powerful tool for regression models that require automatic feature selection and regularization to handle multicollinearity and prevent overfitting. It outperforms traditional linear regression models by creating simpler, more interpretable models. When compared with Ridge Regression, Lasso is preferred when you expect many irrelevant features, while Ridge is better for retaining all features but controlling their impact. Together with other advanced techniques, Lasso and Ridge play a vital role in modern machine learning applications.
Data Source:
The data was retrieved from the World Bank API (WBData). The indicators were related to health expenditure, life expectancy, smoking prevalence, mortality rate, and population growth in India from 1990 to 2023.
Details:
What the Below Python Code Does:
import wbdata
import pandas as pd
import datetime
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import matplotlib.pyplot as plt
# Step 1: Fetch Real-Time Health Data from World Bank API
start_date = datetime.datetime(1990, 1, 1)
end_date = datetime.datetime(2023, 1, 1)
# Select health-related indicators
indicators = {
'SH.XPD.CHEX.PC.CD': 'Health Expenditure Per Capita', # Health Expenditure
'SP.DYN.LE00.IN': 'Life Expectancy', # Life Expectancy
'SH.STA.SMSS.ZS': 'Smoking Prevalence', # Smoking Prevalence
'SH.DYN.MORT': 'Mortality Rate', # Mortality Rate
'SP.POP.GROW': 'Population Growth' # Population Growth
}
# Fetching data for India (you can replace 'IN' with other country codes)
data = wbdata.get_dataframe(indicators, country='IN', date=(start_date, end_date))
# Step 2: Clean and Prepare the Data
data.reset_index(inplace=True)
data.dropna(inplace=True)
# Defining the target variable (for simplicity, using Mortality Rate)
y = data['Mortality Rate']
# Independent variables (features)
X = data[['Health Expenditure Per Capita', 'Life Expectancy', 'Smoking Prevalence', 'Population Growth']]
# Step 3: Show Correlation Matrix
print("Correlation Matrix:")
correlation_matrix = X.corr()
print(correlation_matrix)
# Visualizing the correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Health Data')
plt.show()
# Step 4: Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Step 5: Apply Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)
# Predicting
y_pred = lasso_model.predict(X_test_scaled)
# Step 6: Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("\nLasso Regression Coefficients:")
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': lasso_model.coef_})
print(coefficients)
print(f"\nPerformance Metrics:")
print(f"Mean Squared Error: {mse}")
print(f"R-Squared: {r2}")
# Step 7: Display Actual vs Predicted Mortality Rate
results = pd.DataFrame({
'Actual Mortality Rate': y_test,
'Predicted Mortality Rate': y_pred
})
print("\nActual vs Predicted Mortality Rate:")
print(results.head())
# Dynamic Analysis of Features Removed
print("\n--- Dynamic Analysis ---")
removed_features = coefficients[coefficients['Coefficient'] == 0]
if len(removed_features) > 0:
print(f"The following features were removed by Lasso regularization:\n{removed_features}")
else:
print("No features were removed by Lasso regularization.")
Output from the above code:
These relationships highlight the multicollinearity issues in the data, which makes it an ideal candidate for applying Lasso Regression to identify which features (like Smoking Prevalence) might be dropped during the regularization process.
Analysis of Lasso Regression Results
Lasso Regression Coefficients:
Performance Metrics:
Actual vs Predicted Mortality Rate:
Dynamic Analysis – Features Removed:
Interpretation:
Conclusion :
In this article, we demonstrated how Lasso Regression can be a powerful tool to handle multicollinearity while providing accurate predictions with a simpler, more interpretable model. By applying Lasso to real-time health data from the World Bank, we were able to not only predict mortality rates effectively but also identify and remove features that did not significantly contribute to the model, like Smoking Prevalence. This highlights the utility of Lasso for both feature selection and regularization.
Lasso allowed us to focus on the most impactful variables—Health Expenditure Per Capita, Life Expectancy, and Population Growth—while discarding irrelevant features, thereby simplifying the model and ensuring that it generalizes well to unseen data. With a high R-squared value, the model demonstrates strong predictive accuracy, proving Lasso's effectiveness in handling complex real-world datasets.
In the next article, we will explore Elastic Net Regression, which combines the strengths of both Lasso and Ridge Regression. Elastic Net is particularly useful when dealing with datasets that exhibit both multicollinearity and irrelevant features, making it a flexible and robust approach for even more challenging datasets.
Stay tuned as we dive deeper into Elastic Net Regression and uncover how it can enhance model performance in real-world applications!
MBA | Independent Investor??| Data Analyst | Derivatives Trading?? | Valuations and Financial Modeling | Equity Research Aspirant
1 个月This was incredibly insightful. I love the depth of information shared sir.