Multi-Collinearity

Multicollinearity refers to a high correlation between two or more predictor variables in a regression model. It occurs when there is a linear relationship between independent variables, making it difficult to determine the individual effect of each variable on the dependent variable. Multicollinearity can cause issues in regression analysis, such as unstable coefficient estimates, high standard errors, and difficulty in interpreting the importance of individual variables.

################

1st Approach

import statsmodels.api as sm

from statsmodels.stats.outliers_influence import variance_inflation_factor


# Create a design matrix X and add a constant column

X = sm.add_constant(X)


# Calculate the VIF for each feature

vif = pd.DataFrame()

vif["Variable"] = X.columns

vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]


# Check for variables with high VIF values (typically VIF > 5 indicates high multicollinearity)

high_vif = vif[vif["VIF"] > 5]

##############

2nd Approach

###############

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score


# Fit a linear regression model

model = LinearRegression()

model.fit(X, y)


# Calculate the coefficient of determination (R-squared) for the model

r2 = r2_score(y, model.predict(X))


# Check for variables with high VIF values (using the model's coefficients)

high_vif = model.coef_ > 5


##################


To detect multicollinearity in linear regression without using any libraries, you can examine the correlation matrix and calculate the variance inflation factor (VIF).

  1. Correlation Matrix:
  2. Calculate the correlation matrix of the predictor variables and check for high correlations (typically above 0.7 or 0.8). High correlations suggest potential multicollinearity issues.
  3. Variance Inflation Factor (VIF):
  4. The VIF measures the extent to which the variance of an estimated regression coefficient is increased due to multicollinearity. To calculate the VIF for each predictor variable X[i]:

  • Regress X[i] against all other predictor variables X[j] (where j ≠ i) and obtain the coefficient of determination (R^2).
  • Calculate the VIF as VIF[i] = 1 / (1 - R^2).

  1. Variables with high VIF values (typically above 5) indicate high multicollinearity.

Here's an example to detect multicollinearity using these methods:

###################

3rd Approach

import numpy as np


# Calculate the correlation matrix

corr_matrix = np.corrcoef(X, rowvar=False)


# Identify highly correlated variables (correlation above 0.7)

highly_correlated = np.where(np.abs(corr_matrix) > 0.7)


# Calculate the VIF for each variable

VIF = []

for i in range(X.shape[1]):

??mask = highly_correlated[0] == i

??correlated_vars = highly_correlated[1][mask]

??if len(correlated_vars) > 0:

????# Calculate R-squared

????r_squared = np.corrcoef(X[:, i], X[:, correlated_vars])[0, 1:]**2

????# Calculate VIF

????vif = 1 / (1 - r_squared)

????VIF.append((i, vif))


# Check for variables with high VIF values (typically VIF > 5 indicates high multicollinearity)

high_vif = [(i, vif) for i, vif in VIF if vif > 5]


###################


Sunita Bhadhwa

Data Science| Machine Learning | Statistics | Python |SQL

1 年

we will check multicollinearity before modeling or after modeling

回复

要查看或添加评论,请登录

Vishwajit Sen的更多文章

  • Exploring new opportunities in Data Science

    Exploring new opportunities in Data Science

    Career Objective: Dedicated Data Science and Machine Learning Expert with a passion for driving innovation across…

    1 条评论
  • Technical indicators in the stock market:

    Technical indicators in the stock market:

    Technical indicators in the stock market are mathematical calculations based on historical price, volume, or open…

  • Preparing data for a recommendation system??

    Preparing data for a recommendation system??

    Preparing data for a recommendation system involves organizing and structuring the data in a format that is suitable…

  • Pooling and Padding in CNN??

    Pooling and Padding in CNN??

    Pooling is a down-sampling operation commonly used in convolutional neural networks to reduce the spatial dimensions…

  • What is Computer Vision??

    What is Computer Vision??

    Computer vision is a multidisciplinary field that enables machines to interpret, analyze, and understand the visual…

  • PRUNING in Decision Trees

    PRUNING in Decision Trees

    Pruning is a technique used in decision tree algorithms to prevent overfitting and improve the generalization ability…

    1 条评论
  • "NO" need to check for multicollinearity or remove correlated variables explicitly when using decision trees.

    "NO" need to check for multicollinearity or remove correlated variables explicitly when using decision trees.

    Multicollinearity is a phenomenon in which two or more independent variables in a regression model are highly…

  • MLOps concepts

    MLOps concepts

    MLOps, short for Machine Learning Operations, is a set of practices and tools that combines machine learning (ML) and…

  • Python library & It's Uses

    Python library & It's Uses

    NumPy: Numerical computing library for arrays, matrices, and mathematical functions. Pandas: Data manipulation and…

  • How much do you know about Weight initialization in Neural Networks ??

    How much do you know about Weight initialization in Neural Networks ??

    Weight initialization is a crucial step in training neural networks. It involves setting the initial values of the…

    1 条评论

社区洞察

其他会员也浏览了