"Dynamic Approach to Tackling Multicollinearity & Overfitting with Lasso Regression: Real-Time Health Data Insights"

"Dynamic Approach to Tackling Multicollinearity & Overfitting with Lasso Regression: Real-Time Health Data Insights"

Brief on Regularization on Coefficients in Regression Models

Regularization is a crucial technique in regression models designed to tackle the problems of overfitting and multicollinearity.

By adding a penalty term to the regression cost function, regularization discourages the model from becoming overly complex, thereby promoting simplicity and generalizability. This becomes particularly important when working with high-dimensional data, where many features might contribute little to the model and cause overfitting.

What Causes Overfitting?

  • Irrelevant Variables: Variables that do not provide useful information for the prediction but are included in the model.
  • Redundant Variables: Variables that provide the same information as other variables, leading to redundancy.
  • Complex Models: Models with too many parameters relative to the amount of training data.

How to Mitigate Overfitting:

  • Regularization: Techniques like Lasso and Ridge regression add a penalty term to the cost function, discouraging the model from fitting the noise in the training data.
  • Feature Selection: Removing irrelevant and redundant variables helps in simplifying the model.
  • Cross-Validation: Ensures that the model's performance is consistent across different subsets of the data.

Types of Regularization

1/ L1 Regularization (Lasso)

  • Penalty Term: Adds the absolute values of the coefficients to the cost function.
  • Effect: Shrinks some of the coefficients to exactly zero, effectively performing feature selection. It is a great method for automatically removing irrelevant features from the model.

2/ L2 Regularization (Ridge)

  • Penalty Term: Adds the squared values of the coefficients to the cost function.
  • Effect: Shrinks all coefficients towards zero, but none of them will be exactly zero. Ridge regression retains all variables but penalizes them according to their relevance.

  • Use Case: Ridge is suitable when you believe that all features contribute to the prediction, even if some only have a minor impact.

Refer previous article -Ridge Regression - on Linkedin : https://www.dhirubhai.net/pulse/mitigating-multicollinearity-health-data-ridge-ravichandran-c2flc/?trackingId=8BFwhWcpQdiMuul1l9rk%2BQ%3D%3D

3/ Elastic Net

  • Penalty Term: Combines both L1 (Lasso) and L2 (Ridge) penalties.
  • Effect: Shrinks some coefficients to zero (like Lasso), while reducing others (like Ridge). Elastic Net provides the benefits of both Lasso and Ridge, making it a more flexible option for different types of data.


Use Case: Elastic Net is useful when both irrelevant features and multicollinearity are concerns. It strikes a balance between Lasso's feature selection and Ridge's regularization of all features.

Benefits of Regularization

  1. Prevents Overfitting: Regularization prevents the model from learning the noise in the data by penalizing large coefficients. This makes the model simpler and less prone to overfitting, especially in high-dimensional spaces.
  2. Reduces Multicollinearity: When predictors are highly correlated, the model may suffer from inflated standard errors. Regularization helps by shrinking coefficients, reducing the model's sensitivity to correlated predictors.
  3. Improves Generalization: Regularized models tend to perform better on unseen test data because they are simpler and focus on the most important relationships, improving the model’s performance in real-world applications.

Elastic Net: A Hybrid Approach

Elastic Net combines the strengths of both Lasso and Ridge regression. It applies both L1 and L2 penalties, thus achieving feature selection (Lasso’s strength) while maintaining some flexibility by shrinking coefficients rather than eliminating them entirely (Ridge’s strength). Elastic Net is useful when you want to combine the benefits of both methods, particularly in datasets with many highly correlated variables.

Introduction to Lasso Regression

Lasso Regression (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that introduces regularization in the model. This method is especially useful when the dataset contains multiple features, some of which may not contribute significantly to the model. By applying a penalty to the regression coefficients, Lasso encourages simpler models that avoid overfitting and reduce multicollinearity, ultimately improving generalization to new data.

The Need for Lasso Regression

As datasets grow in complexity, they often contain a large number of features (variables), many of which may be irrelevant or redundant. Traditional Ordinary Least Squares (OLS) regression models may overfit the data, meaning they perform well on the training data but poorly on unseen test data. Overfitting becomes more problematic when the features are highly correlated (multicollinearity), leading to unstable and unreliable coefficients.

Here are the main reasons for using Lasso regression:

  1. Feature Selection: Lasso automatically selects important features by shrinking the coefficients of less relevant features to zero. This leads to sparse models that include only the most significant predictors.
  2. Reducing Multicollinearity: Lasso helps address multicollinearity by penalizing large coefficients, ensuring that the model is not overly sensitive to small changes in the data.
  3. Preventing Overfitting: By regularizing the regression coefficients, Lasso ensures that the model doesn’t overfit the training data, improving its performance on new, unseen data.

Understanding Multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. This can lead to inflated standard errors for the regression coefficients, making it difficult to assess the effect of each independent variable on the dependent variable.

Effects of Multicollinearity:

  • Unstable Coefficients: Small changes in the data can result in large changes in the regression coefficients.
  • Difficult Interpretation: It's challenging to determine the individual effect of each predictor on the outcome variable.
  • Reduced Predictive Power: The model may overfit the training data and perform poorly on test data.

Detecting Multicollinearity:

  • Correlation Matrix: A simple way to detect multicollinearity is by examining the correlation matrix. Highly correlated independent variables are indicators of multicollinearity.
  • Variance Inflation Factor (VIF): VIF quantifies how much the variance of the regression coefficients is inflated due to multicollinearity. A VIF value above 10 is typically considered problematic.

Steps to Apply Lasso Regression

  1. Data Preparation: Gather and clean your data, ensuring that all missing values are handled and features are properly scaled (especially for Lasso, as it’s sensitive to feature scaling).
  2. Train-Test Split: Divide the data into training and test sets (e.g., 70% for training and 30% for testing) to ensure that the model generalizes well to unseen data.
  3. Model Initialization: Initialize the Lasso regression model by specifying the penalty term (alpha). A larger alpha value leads to stronger regularization, shrinking coefficients more aggressively.
  4. Model Training: Fit the Lasso model to the training data.
  5. Feature Selection: Evaluate which features have non-zero coefficients. Lasso automatically selects important features by shrinking the coefficients of less relevant features to zero.
  6. Model Evaluation: Use metrics such as R-squared (for goodness-of-fit) and Mean Squared Error (MSE) to evaluate the performance of the model on the test data.
  7. Fine-tuning: Adjust the alpha parameter to find the optimal balance between underfitting and overfitting.

Lasso vs. Ridge Regression

Both Lasso and Ridge are regularization techniques, but they differ in how they penalize the regression coefficients.

When to Use Lasso or Ridge Regression?

  • Use Lasso: When you believe that many of your features are irrelevant or redundant. Lasso performs feature selection, leading to a simpler model with fewer predictors.
  • Use Ridge: When you believe that all your features contribute to the prediction, even if some have small effects. Ridge regression retains all features, simply reducing the size of their coefficients.

Lasso Regression is a powerful tool for regression models that require automatic feature selection and regularization to handle multicollinearity and prevent overfitting. It outperforms traditional linear regression models by creating simpler, more interpretable models. When compared with Ridge Regression, Lasso is preferred when you expect many irrelevant features, while Ridge is better for retaining all features but controlling their impact. Together with other advanced techniques, Lasso and Ridge play a vital role in modern machine learning applications.

Data Source:

The data was retrieved from the World Bank API (WBData). The indicators were related to health expenditure, life expectancy, smoking prevalence, mortality rate, and population growth in India from 1990 to 2023.

Details:

  • Data Source: World Bank API (wbdata)
  • Country: India
  • Time Period: 1990 - 2023 (33 Years of Data)
  • Target Variable (Dependent Variable): Mortality Rate -This is the health outcome we are trying to predict.
  • Independent Variables (Features):Health Expenditure Per Capita (USD): Measures the per capita spending on healthcare.
  • Life Expectancy (Years): The average number of years a person is expected to live.
  • Smoking Prevalence (%): Percentage of people who smoke.Population Growth (%): Annual population growth rate.

What the Below Python Code Does:

  1. Fetches Real-Time Health Data: The World Bank API retrieves real-time data for selected health indicators for India.
  2. Applies Lasso Regression: After splitting and standardizing the data, Lasso regression is applied to remove irrelevant features.
  3. Displays Correlation Matrix: It shows the correlation matrix of the independent variables to assess multicollinearity.
  4. Predicts Mortality Rate: Compares actual vs predicted mortality rates and prints them.
  5. Dynamic Analysis: The code dynamically analyzes and reports the features that Lasso regularization drops based on the coefficient values.


import wbdata
import pandas as pd
import datetime
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Fetch Real-Time Health Data from World Bank API
start_date = datetime.datetime(1990, 1, 1)
end_date = datetime.datetime(2023, 1, 1)

# Select health-related indicators
indicators = {
    'SH.XPD.CHEX.PC.CD': 'Health Expenditure Per Capita',  # Health Expenditure
    'SP.DYN.LE00.IN': 'Life Expectancy',                   # Life Expectancy
    'SH.STA.SMSS.ZS': 'Smoking Prevalence',                # Smoking Prevalence
    'SH.DYN.MORT': 'Mortality Rate',                       # Mortality Rate
    'SP.POP.GROW': 'Population Growth'                     # Population Growth
}

# Fetching data for India (you can replace 'IN' with other country codes)
data = wbdata.get_dataframe(indicators, country='IN', date=(start_date, end_date))

# Step 2: Clean and Prepare the Data
data.reset_index(inplace=True)
data.dropna(inplace=True)

# Defining the target variable (for simplicity, using Mortality Rate)
y = data['Mortality Rate']

# Independent variables (features)
X = data[['Health Expenditure Per Capita', 'Life Expectancy', 'Smoking Prevalence', 'Population Growth']]

# Step 3: Show Correlation Matrix
print("Correlation Matrix:")
correlation_matrix = X.corr()
print(correlation_matrix)

# Visualizing the correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Health Data')
plt.show()

# Step 4: Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Apply Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)

# Predicting
y_pred = lasso_model.predict(X_test_scaled)

# Step 6: Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nLasso Regression Coefficients:")
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': lasso_model.coef_})
print(coefficients)

print(f"\nPerformance Metrics:")
print(f"Mean Squared Error: {mse}")
print(f"R-Squared: {r2}")

# Step 7: Display Actual vs Predicted Mortality Rate
results = pd.DataFrame({
    'Actual Mortality Rate': y_test,
    'Predicted Mortality Rate': y_pred
})
print("\nActual vs Predicted Mortality Rate:")
print(results.head())

# Dynamic Analysis of Features Removed
print("\n--- Dynamic Analysis ---")
removed_features = coefficients[coefficients['Coefficient'] == 0]
if len(removed_features) > 0:
    print(f"The following features were removed by Lasso regularization:\n{removed_features}")
else:
    print("No features were removed by Lasso regularization.")
        

Output from the above code:


  1. Health Expenditure Per Capita has a strong positive correlation with Life Expectancy (0.90), suggesting that increased spending on healthcare is associated with a longer life expectancy.
  2. Smoking Prevalence has a strong negative correlation with Population Growth (-0.98). This could indicate that in populations with higher smoking rates, growth rates may be lower, though this would require further investigation.
  3. Population Growth is negatively correlated with Health Expenditure Per Capita (-0.98) and Life Expectancy (-0.88), suggesting that higher population growth may be associated with lower health outcomes and expenditures.

These relationships highlight the multicollinearity issues in the data, which makes it an ideal candidate for applying Lasso Regression to identify which features (like Smoking Prevalence) might be dropped during the regularization process.


Analysis of Lasso Regression Results

Lasso Regression Coefficients:

  • Health Expenditure Per Capita: Coefficient of -3.416 suggests that an increase in health expenditure per capita has a negative effect on mortality rate, implying that higher spending may help reduce mortality.
  • Life Expectancy: Coefficient of -8.792 shows a strong negative impact, meaning that as life expectancy increases, mortality rate decreases significantly, which is expected.
  • Smoking Prevalence: This variable was removed by Lasso (coefficient = 0). This suggests that, in the presence of the other variables, smoking prevalence does not have a strong predictive impact on mortality rate for this dataset.
  • Population Growth: Coefficient of 5.737 shows a positive impact on mortality, implying that as population growth increases, mortality rates might rise, which could indicate the stress of higher population on health systems.

Performance Metrics:

  • Mean Squared Error (MSE): 7.79 is relatively low, indicating that the model's predictions are close to the actual mortality rates.
  • R-Squared: 0.98, which is very close to 1, shows that the model explains 98% of the variance in mortality rates. This indicates a highly accurate model.

Actual vs Predicted Mortality Rate:

  • The actual vs. predicted mortality rates show how closely the model's predictions align with reality. The predictions are very close to the actual values, further supporting the high R-squared value.

Dynamic Analysis – Features Removed:

  • Smoking Prevalence: Lasso identified that smoking prevalence does not contribute significantly to predicting mortality rate in this model, possibly due to multicollinearity with other variables (such as health expenditure) or because its effect is negligible when combined with other predictors.

Interpretation:

  • Lasso Regularization has effectively simplified the model by removing a feature (smoking prevalence) that does not provide much predictive power.
  • The model provides a highly accurate prediction of mortality rate based on health expenditure, life expectancy, and population growth. However, the removal of smoking prevalence indicates that, for this dataset, it does not offer additional explanatory power once the other variables are accounted for.

Conclusion :

In this article, we demonstrated how Lasso Regression can be a powerful tool to handle multicollinearity while providing accurate predictions with a simpler, more interpretable model. By applying Lasso to real-time health data from the World Bank, we were able to not only predict mortality rates effectively but also identify and remove features that did not significantly contribute to the model, like Smoking Prevalence. This highlights the utility of Lasso for both feature selection and regularization.

Lasso allowed us to focus on the most impactful variables—Health Expenditure Per Capita, Life Expectancy, and Population Growth—while discarding irrelevant features, thereby simplifying the model and ensuring that it generalizes well to unseen data. With a high R-squared value, the model demonstrates strong predictive accuracy, proving Lasso's effectiveness in handling complex real-world datasets.

In the next article, we will explore Elastic Net Regression, which combines the strengths of both Lasso and Ridge Regression. Elastic Net is particularly useful when dealing with datasets that exhibit both multicollinearity and irrelevant features, making it a flexible and robust approach for even more challenging datasets.

Stay tuned as we dive deeper into Elastic Net Regression and uncover how it can enhance model performance in real-world applications!

Shantanu Bhardwaj

MBA | Independent Investor??| Data Analyst | Derivatives Trading?? | Valuations and Financial Modeling | Equity Research Aspirant

1 个月

This was incredibly insightful. I love the depth of information shared sir.

要查看或添加评论,请登录

Vaidyanathan Ravichandran的更多文章

社区洞察

其他会员也浏览了