Multicollinearity - understanding the relationship between variables

No alt text provided for this image

Multicollinearity

Multicollinearity or simply collinearity is defined by the phenomenon in which two or more independent features of a dataset are highly correlated while working on a linear model. In other words, if one of the independent features can be linearly predicted in relation to the other independent features with substantial accuracy, this type of dependability or relations is called multicollinearity.

There are two types of multicollinearity:

  1. Structural multicollinearity is a mathematical artifact caused by creating new predictors from other predictors — such as, creating the predictor x2 from the predictor x.
  2. Data-based multicollinearity, on the other hand, is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.

Before we go deeper into multicollinearity, let’s understand the two basic concepts of Covariance and Correlation.

Covariance

We use covariance to measure the amount of change in one independent variable with the change in other independent variable. Consider a dataset of 3 independent variables and one dependent variable. The dependent variable will obviously have a correlation with the three independent variables however, in case of covariance, there is a relation between one independent variable with another independent variable.

In statistics, covariance is the measure of how much two random variables change together. If the value of one variable increases with the increase in another variable and decreases along with it, the covariance is considered as positive. Whereas, if the value of one variable decreases with the increase in other variable and increases with decrease of the value of another variable, the covariance is considered as negative.

The value of covariance can be calculated using the below formula:

COV(X, Y) = E([X – E(X)]*[Y – E(Y)])

Where X & Y are two independent variables, E(X) and E(Y) are mean values of X & Y respectively.

Correlation

Where covariance is the measure of how much the value of one variable changes with the change in another variable, correlation refers to the extent to which two independent variables have a linear relationship with each other. It can be calculated by dividing the covariance between the two variables by the standard deviation of the two variables:

COR(X, Y) = COV(X, Y) / SD(X) SD(Y)

Where SD(X) & SD(Y) are standard deviations of X & Y respectively.

Usually if the correlation between two independent variables is high ( > 0.80 ), we drop the variables while training the machine. There are many tools and options to determine the correlation between two variables. SAS uses the PROC CORR and excel has CORREL() to do so.

No alt text provided for this image

We can see from the above chart, the maximum correlation between independent variables is 0.87 between X3 & X6. Now, the correlation values X3 and X6 with the dependent variable Y are 0.61 & 0.58 respectively. Thus, we can drop the column X3 from our computation.

Identifying multicollinearity

Variance Inflation Factor (aka VIF) is the most commonly used collinearity diagnostic factor that can be used to identify multicollinearity between two variables. In addition to VIF, there is another tool that can be used to work with multicollinearity. This tool is called as Tolerance and is a measure of collinearity reported by many statistical programs such as SPSS. The value of tolerance of one variable is calculated as 1- R2.

Smaller tolerance value indicates that there are independent variables in the equation which has a high level of linear covariance and should be removed from the equation. All variables involved in the linear relationship will have a small tolerance. If a low tolerance value is accompanied by large standard errors and non-significance, multicollinearity may be an issue.

Variance Inflation Factor (aka VIF)

The Variance Inflation Factor (VIF) measures the impact of collinearity among the variables in a regression model. The Variance Inflation Factor (VIF) is 1/Tolerance, it is always greater than or equal to 1. In many statistics programs, the results are shown both as an individual R2 value (distinct from the overall R2 of the model) and a Variance Inflation Factor (VIF). When those R2 and VIF values are high for any of the variables in your model, multicollinearity is probably an issue. When VIF is high there is high multicollinearity.

It is important to understand that the VIF anges from 1 upwards, where the VIF tells you in (decimal form) by what percentage the variance i.e. standard error squared is inflated for each coefficient.

Example:

VIF of 1.9 indicates that the variance for a particular coefficient is 90% bigger than what one should expect it to be.

Rules for identifying collinearity using VIF technique:

  • If all values of VIF are near 1 indicates no collinearity between the predictor variables
  • VIF of >1 to 5 indicates moderate collinearity
  • VIF of >5 indicates serious collinearity

VIF values greater than 10 may indicate multicollinearity is unduly influencing your regression results.

How to remove multicollinearity?

1.     Dropping the independent variable: when we find that there are two variables which have high collinearity between them, it is advisable not to use both of them in machine learning. However, the question arises, which one to drop and which one to use. In order to make that decision, we usually obtain the correlation between them individually with the dependent variable. This is done to determine, which of them is able to define the dependent variable in a better way. Now, the variable which is defining the dependent variable in a lesser significant way can be dropped from the equation. This however, certainly comes for a price as we are ignoring one valuable piece of information while training the machine.

2.     Obtain more data: This is the preferred solution. More data can produce more precise parameter estimates (with lower standard errors), as seen from the formula in VIF for the variance of the estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity.

3.     Shapely value: It is a game theory tool that the model can use to overcome the multicollinearity. The Shapley value assigns a value for each predictor and assesses all possible combinations of importance.

4.     If the correlated explanators are different lagged values of the same underlying explanator, then a distributed lag technique can be used, imposing a general structure on the relative values of the coefficients to be estimated.

5.     Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to the whole data set. Theoretically you should obtain somewhat higher variance from the smaller datasets used for estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient values will vary, but look at how much they vary.

Can multicollinearity be ignored?

So far we have been talking about the adverse effects of having multicollinearity while traning the machine. In other words, we should avoid having higher VIF while creating data models. However, there are scenarios where we can actually live with the higher VIF values.

  • The variables with higher VIF values are dummy variables which represent categorical features with three or more categories.
  • The variables with high VIFs are control variables and the variable of interest do not have high VIF.
  • The high VIFs are caused by the inclusion of powers or products of other variables

 


要查看或添加评论,请登录

Gautam Kumar的更多文章

  • Treating outliers on a dataset

    Treating outliers on a dataset

    An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In…

  • What is Cloud Computing

    What is Cloud Computing

    The most simplistic definition of cloud computing is the delivery of on-demand IT services over the internet. The…

    1 条评论
  • An Introduction to Lambda Function

    An Introduction to Lambda Function

    Functions are basically piece of codes which execute only when we invoke them. For any programming language, functions…

  • Understanding Support Vector Machine

    Understanding Support Vector Machine

    Support Vector Machine: An Introduction I have talked about Linear regression and Classification on my prior articles…

  • Classification in Data Science

    Classification in Data Science

    What is Classification? Although classification can be performed on both structured and unstructured data, it is mainly…

  • Understanding the basics of Data Clustering

    Understanding the basics of Data Clustering

    Clustering Clustering is the task of dividing the population or data points into a few groups such that data points in…

  • Dimension Reduction - Principal Component Analysis (aka PCA)

    Dimension Reduction - Principal Component Analysis (aka PCA)

    Being in an era of data flowing from every here and there, we often come across scenarios that we gather way too much…

    2 条评论
  • Understanding the ROC & AUC

    Understanding the ROC & AUC

    Introduction In any type of machine learning, we need to calculate the accuracy of the model for performance…

  • Linear Regression

    Linear Regression

    When it comes to supervised machine learning, there are two types of learning algorithms: Regression – this basically…

社区洞察

其他会员也浏览了