Violating Linear Regression Assumptions: A guide on what not to do

Violating Linear Regression Assumptions: A guide on what not to do

Parametric models have lost their sheen in the age of Deep Learning. But for smaller datasets, and when interpretability outweighs predictive power, models like linear and logistic regressions still hold the sway. While most data scientists evaluate at least some form of regression models at start, they are generally discarded due to not performing at par with non-parametric models; the fault though is not always of the model's. This article is a brief overview on how models are often corrupted due to the violation of the below assumptions:

  1. There is a linear relationship between X(read independent variables) and Y (read dependent variables)

2. Independent variables are additive.

3. Autocorrelation (No relationship between residual terms, this translates to no relationship between each datapoint)

4. Multicollinearity: X variables are not correlated

5. Homoskedasticity (Error term has a constant variance)

6. Residuals are normally distributed ( This should be ideally; but meh, you can get away with it)

Terms you will see throughout the articles:

y : Actual value of dependent variable

y^ : Predicted value of dependent variable

residual error: y - y^


Now, we can have a look how volition of each of the assumption can wreck your model.

1. Linear relationship

Due to the parametric nature of linear regression, we are limited by the straight line relationship between X and Y. If the relationship is non-linear, all the conclusions drawn from the model are wrong, and this leads to wide divergence between training and test data.

How to identify:

  1. When one of the X variables is non linear: Plot the residuals (y???y?) vs each of the predictors x
  2. When there are too many variables / Y is non linear: Plot the residual plot (y???y?) vs the

For example looking at the below plot, there is clearly a violation of the assumption

No alt text provided for this image


Dealing With It:

If the residual plots show signs of non-linearity, quadratic transformations like logX,?√X, and?X^2,X^4,etc. do help in building a better model. For example, using X^2 on the above model, we see that the non linearity conditions have eased; through more experiments we can often find a better transformation

No alt text provided for this image

2. Additive relationship between dependent variables

This is perhaps the most violated assumption, and the primary reason why tree models outperform linear models on a huge scale. Since output of linear regression/logistic regression is dependent on the sum of the variables multiplied by their coefficients, the assumption is that each variable is independent of the other. Which is rarely the case.

Think of it this way, suppose we are building a revenue model for Uber and we discover the coefficient for number of cars available in a city is 1000 and the same for number of drivers in a city is 100. So Uber buys 100 more cars, but hires no drivers. So the model will predict 1000*100 additional revenue; but alas there's no driver to drive them. This is called a synergy/ interaction between variables, and perhaps the reason why trees beat linear models.

How to identify:

As we saw in the above example, if we have 1000 cars, the linear model will predict additional revenue of 1000*100, or overestimate. But if we have 100 cars and 100 drivers, the model will underestimate the revenue. In short, along X1 = X2, there will be positive residuals, whereas along X1, X2 there will be negative residuals.

No alt text provided for this image


How to Deal with it:

Add an additional interaction factor (X1*X2) which takes into account the combined effect of X1 and X2


3. Autocorrelation

Autocorrelation refers to no correlation between residual errors. Residual errors are just (y - y^); and since y is constant; Autocorrelation refers to no correlation between y^s, or no correlation between different rows.

In days of under and over sampling, you might wonder what difference does this make? Well, from a prediction point, none. But if you are looking from a statistical confidence interval, it makes a lot of difference.

Confidence intervals are inversely correlated to the number of observations. Think of it this way, the more the number of points you have, the more confident you are about your model. Say we doubled all the observations; we will get the same exact model, but our confidence intervals will drop by √2.

But again, this only affects the confidence interval, so if you are looking for predictions and not statistical confidence, this won't make a difference.

4. Collinearity

Collinearity is the presence of highly correlated variables within X. Again, this is one of the effects that don't really affect the prediction to a great extent. But it really creates a mess with the model's interpretability. This makes intuitive sense, because if two variables are highly correlated, you wouldn't know what is actually explaining the variance in dependent variable Y.

How to identify:

No alt text provided for this image

How to Deal with it: Drop it!

5. Homoskedasticity (Error term has a constant variance)

This is perhaps the silent killer among all the assumptions! Everything about linear regression: the hypothesis tests, the standard errors and the confidence intervals; all depend on the assumption that the residual errors have constant variance.

How to identify it:

Pretty simple, look at the variance of residuals against y, there should not be any noticeable pattern

No alt text provided for this image


How to deal with it:

A log transformation works well most of the times

No alt text provided for this image



要查看或添加评论,请登录

社区洞察

其他会员也浏览了