登录查看更多内容

Multi-Collinearity

Vishwajit Sen

Data Science / AI / ML / DL Senior Manager

发布日期: 2023年6月28日

Multicollinearity refers to a high correlation between two or more predictor variables in a regression model. It occurs when there is a linear relationship between independent variables, making it difficult to determine the individual effect of each variable on the dependent variable. Multicollinearity can cause issues in regression analysis, such as unstable coefficient estimates, high standard errors, and difficulty in interpreting the importance of individual variables.

################

1st Approach

import statsmodels.api as sm

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a design matrix X and add a constant column

X = sm.add_constant(X)

# Calculate the VIF for each feature

vif = pd.DataFrame()

vif["Variable"] = X.columns

vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Check for variables with high VIF values (typically VIF > 5 indicates high multicollinearity)

high_vif = vif[vif["VIF"] > 5]

##############

2nd Approach

###############

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

# Fit a linear regression model

model = LinearRegression()

model.fit(X, y)

# Calculate the coefficient of determination (R-squared) for the model

r2 = r2_score(y, model.predict(X))

# Check for variables with high VIF values (using the model's coefficients)

high_vif = model.coef_ > 5

领英推荐

How to deal with Multicollinearity?

Mohammad Arshad 4 年前

Copulas explained

Jason Shafrin 1 年前

Understanding P-values is essential for improving…

Hamza O. 9 个月前

##################

To detect multicollinearity in linear regression without using any libraries, you can examine the correlation matrix and calculate the variance inflation factor (VIF).

Correlation Matrix:
Calculate the correlation matrix of the predictor variables and check for high correlations (typically above 0.7 or 0.8). High correlations suggest potential multicollinearity issues.
Variance Inflation Factor (VIF):
The VIF measures the extent to which the variance of an estimated regression coefficient is increased due to multicollinearity. To calculate the VIF for each predictor variable X[i]:

Regress X[i] against all other predictor variables X[j] (where j ≠ i) and obtain the coefficient of determination (R^2).
Calculate the VIF as VIF[i] = 1 / (1 - R^2).

Variables with high VIF values (typically above 5) indicate high multicollinearity.

Here's an example to detect multicollinearity using these methods:

###################

3rd Approach

import numpy as np

# Calculate the correlation matrix

corr_matrix = np.corrcoef(X, rowvar=False)

# Identify highly correlated variables (correlation above 0.7)

highly_correlated = np.where(np.abs(corr_matrix) > 0.7)

# Calculate the VIF for each variable

VIF = []

for i in range(X.shape[1]):

??mask = highly_correlated[0] == i

??correlated_vars = highly_correlated[1][mask]

??if len(correlated_vars) > 0:

????# Calculate R-squared

????r_squared = np.corrcoef(X[:, i], X[:, correlated_vars])[0, 1:]**2

????# Calculate VIF

????vif = 1 / (1 - r_squared)

????VIF.append((i, vif))

# Check for variables with high VIF values (typically VIF > 5 indicates high multicollinearity)

high_vif = [(i, vif) for i, vif in VIF if vif > 5]

###################

Sunita Bhadhwa

Data Science| Machine Learning | Statistics | Python |SQL

1 年

we will check multicollinearity before modeling or after modeling

查看更多评论

要查看或添加评论，请登录

Vishwajit Sen的更多文章

Exploring new opportunities in Data Science

2023年10月25日

Exploring new opportunities in Data Science

Career Objective: Dedicated Data Science and Machine Learning Expert with a passion for driving innovation across…

1 条评论
Technical indicators in the stock market:

2023年10月7日

Technical indicators in the stock market:

Technical indicators in the stock market are mathematical calculations based on historical price, volume, or open…
Preparing data for a recommendation system??

2023年10月7日

Preparing data for a recommendation system??

Preparing data for a recommendation system involves organizing and structuring the data in a format that is suitable…
Pooling and Padding in CNN??

2023年10月7日

Pooling and Padding in CNN??

Pooling is a down-sampling operation commonly used in convolutional neural networks to reduce the spatial dimensions…
What is Computer Vision??

2023年10月7日

What is Computer Vision??

Computer vision is a multidisciplinary field that enables machines to interpret, analyze, and understand the visual…
PRUNING in Decision Trees

2023年10月5日

PRUNING in Decision Trees

Pruning is a technique used in decision tree algorithms to prevent overfitting and improve the generalization ability…

1 条评论
"NO" need to check for multicollinearity or remove correlated variables explicitly when using decision trees.

2023年10月5日

"NO" need to check for multicollinearity or remove correlated variables explicitly when using decision trees.

Multicollinearity is a phenomenon in which two or more independent variables in a regression model are highly…
MLOps concepts

2023年9月21日

MLOps concepts

MLOps, short for Machine Learning Operations, is a set of practices and tools that combines machine learning (ML) and…
Python library & It's Uses

2023年8月11日

Python library & It's Uses

NumPy: Numerical computing library for arrays, matrices, and mathematical functions. Pandas: Data manipulation and…
How much do you know about Weight initialization in Neural Networks ??

2023年8月9日

How much do you know about Weight initialization in Neural Networks ??

Weight initialization is a crucial step in training neural networks. It involves setting the initial values of the…

1 条评论

See all articles

Multi-Collinearity

Vishwajit Sen

Data Science / AI / ML / DL Senior Manager

领英推荐

Vishwajit Sen的更多文章

社区洞察

其他会员也浏览了

R-squared in Regression Analysis

10 Assumptions of Linear Regression

The Distribution of Independent Variables in Regression Models

Fit & predict for regression

What is Multicollinearity? A Visual Description

Overfitting in Regression Models

Counting Too Many Zeros? Try Zero- Inflated Poisson Models

Q. How to choose the best-fit among various Statistical Models ?

Approaches to Repeated Measures: Repeated Measures ANOVA, Marginal, and Mixed Models

Proportions as Dependent Variable in Regression–Which Type of Model?

领英推荐

Vishwajit Sen的更多文章

Exploring new opportunities in Data Science

Technical indicators in the stock market:

Preparing data for a recommendation system??

Pooling and Padding in CNN??

What is Computer Vision??

PRUNING in Decision Trees

"NO" need to check for multicollinearity or remove correlated variables explicitly when using decision trees.

MLOps concepts

Python library & It's Uses

How much do you know about Weight initialization in Neural Networks ??

社区洞察

其他会员也浏览了

R-squared in Regression Analysis

10 Assumptions of Linear Regression

The Distribution of Independent Variables in Regression Models

Fit & predict for regression

What is Multicollinearity? A Visual Description

Overfitting in Regression Models

Counting Too Many Zeros? Try Zero- Inflated Poisson Models

Q. How to choose the best-fit among various Statistical Models ?

Approaches to Repeated Measures: Repeated Measures ANOVA, Marginal, and Mixed Models

Proportions as Dependent Variable in Regression–Which Type of Model?