Multi-Collinearity
Multicollinearity refers to a high correlation between two or more predictor variables in a regression model. It occurs when there is a linear relationship between independent variables, making it difficult to determine the individual effect of each variable on the dependent variable. Multicollinearity can cause issues in regression analysis, such as unstable coefficient estimates, high standard errors, and difficulty in interpreting the importance of individual variables.
################
1st Approach
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Create a design matrix X and add a constant column
X = sm.add_constant(X)
# Calculate the VIF for each feature
vif = pd.DataFrame()
vif["Variable"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Check for variables with high VIF values (typically VIF > 5 indicates high multicollinearity)
high_vif = vif[vif["VIF"] > 5]
##############
2nd Approach
###############
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Fit a linear regression model
model = LinearRegression()
model.fit(X, y)
# Calculate the coefficient of determination (R-squared) for the model
r2 = r2_score(y, model.predict(X))
# Check for variables with high VIF values (using the model's coefficients)
high_vif = model.coef_ > 5
领英推荐
##################
To detect multicollinearity in linear regression without using any libraries, you can examine the correlation matrix and calculate the variance inflation factor (VIF).
Here's an example to detect multicollinearity using these methods:
###################
3rd Approach
import numpy as np
# Calculate the correlation matrix
corr_matrix = np.corrcoef(X, rowvar=False)
# Identify highly correlated variables (correlation above 0.7)
highly_correlated = np.where(np.abs(corr_matrix) > 0.7)
# Calculate the VIF for each variable
VIF = []
for i in range(X.shape[1]):
??mask = highly_correlated[0] == i
??correlated_vars = highly_correlated[1][mask]
??if len(correlated_vars) > 0:
????# Calculate R-squared
????r_squared = np.corrcoef(X[:, i], X[:, correlated_vars])[0, 1:]**2
????# Calculate VIF
????vif = 1 / (1 - r_squared)
????VIF.append((i, vif))
# Check for variables with high VIF values (typically VIF > 5 indicates high multicollinearity)
high_vif = [(i, vif) for i, vif in VIF if vif > 5]
###################
Data Science| Machine Learning | Statistics | Python |SQL
1 年we will check multicollinearity before modeling or after modeling