How to Deal with Multicollinearity?

How to Deal with Multicollinearity?

In the days of automl, drag and drop applications, and quick-fix models, we are forgetting a major factor that may skew your model results


Multiple linear regression does not work as well as we would like it to in all situations. We will run into problems if the underlying model is not linear or if we have heteroscedasticity, clustering, or outliers.


?In this article, we will discuss techniques for addressing one of the two of the most common problems that can skew the results of multiple linear regression even when the model is linear and homoscedastic with no clustering or outliers. These problems are:

?1.???Multicollinearity

When some of the predictor variables are too similar or too closely correlated with one another, it makes it very difficult to sort out their separate effects. In this blog, we will examine how to identify and address multicollinearity

2.???Model specification issues

Because the various x variables interact with one another, regression results can change significantly based on which variables are included in the model. We must use care in choosing which predictor variables to include in order to get the most useful results for analysis and prediction. In my next blog, we will examine techniques for assessing which variables should and should not be included in the model.

Multicollinearity occurs when two or more of the predictor (x) variables are correlated with each other. When one or more predictors move together, it is difficult for multiple regression to distinguish between their individual effects. This may affect your estimated coefficients in several ways.

1.?High Standard Errors

The standard errors for the coefficients will be inflated, which may result in higher p-values for the hypothesis test of significance on the individual coefficients.

?In extreme examples, the model may be highly significant overall, but none of the individual coefficients will pass the test of significance.

2.?Incorrect Signs

One or more of the coefficient estimates may have a sign that is inconsistent with intuition.?For example, a model may indicate that the less satisfied customers are with service levels, the more satisfied they are overall. This does not make logical sense and is a clue that satisfaction with customer service levels is likely correlated with another variable that is skewing its estimated impact.

3. Instability

When predictor variables are correlated, the estimated coefficients can change wildly as variables are added to or dropped from the model. This is because of the fact that multiple linear regression estimates the impact of a certain predictor variable while controlling for (or holding constant) the other predictor variables in the model. In a model in which the various predictor variables are related to one another, this will result in a very different interpretation than if you were looking at the impact of any variable alone.

No alt text provided for this image

?

This is shown by the omitted variable bias theorem. If a true model includes two related predictor variables (x1 and x2) and one of those variables (x2) is left out, the coefficient estimate of the remaining variable (x1) will be biased. This bias occurs because the model compensates for the missing x2 variable by either overestimating or underestimating the effect of x1. The direction of the bias will depend on the sign of the correlation between x1 and x2 and the sign of β2 (the regression coefficient of x2).


While multi-collinearity affects the stability and accuracy of coefficient estimates, it does not affect the results when the model is used for prediction purposes alone.

?Let’s look at an example. Suppose we have the three observations shown here. Note that

Image Source: wikipedia

x2 = 2x1, so the two variables are perfectly correlated. Suppose that the true regression model is y? = 2x1 + x2. However, there are many other models that could potentially fit this same data set, such as:

?y? = 4x1 (with no effect of x2)

?y? = 2x2 (with no effect of x1)

?y? = 6x1 – x2 (where x1 has a positive effect and x2 has a negative effect)

?With all of these models, the overall model will give you the correct prediction of y, but the interpretation of the individual coefficients is completely different.

?What happens if we try to extrapolate beyond our data set by predicting y if both x1 and x2 equal one? The correct answer according to the true regression model is that y equals three, but the other models give estimates of four, two, and five.

?This extreme example shows that even with perfectly correlated variables, a model can still be used for prediction purposes as long as you do not extrapolate beyond the established range of data. However, clients are often interested not only in prediction but also in interpretation and examination of the effects of individual variables.

When deciding the variables we also need to see the business significance of the variable over and above the multicollinearity.

This article gives an understanding of multicollinearity and it is a best practice to present clients with a model that is not overly complicated by correlated predictor variables.

Agree? Kindly comment on your experience dealing with multicollinearity?

Please subscribe to the weekly newsletter for such great information every Wednesday.


#datascience #machinelearning #deeplearning #artificialintelligence

Arshad Qureshi

React || React Hooks || Redux Toolkit || MUI || HTML || CSS || Bootstrap || Ruby || Rails || Git || GitHub || Gitlab

2 年

Thank you so much Mohammad Arshad sir for sharing .

回复
Tabassum Nahid Sultana

ML researcher at VTU Belguam

2 年

Love this

Thanks for posting. Multicollinearity

Gina (MamaEpps) Epps

LinkedIn Top 250 Rising Star Influencers, 63,000 plus Linked In Network (I connect all the right people), Hemp Executive,, Co-Host of The Hempy Hour Podcast. One love is universal love for all and by all people.

2 年

Valuable share Mohammad Arshad

Adefolahanmi A. Adedeji (A3)

Farmer | Inspired Writer | Digital Skills Advocate | Helping you become more confident with Data Analytics, Python, and Machine Learning | The Tech Enthusiast

2 年

Thanks for sharing this great concept.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了