Violating Linear Regression Assumptions: A guide on what not to do
Parametric models have lost their sheen in the age of Deep Learning. But for smaller datasets, and when interpretability outweighs predictive power, models like linear and logistic regressions still hold the sway. While most data scientists evaluate at least some form of regression models at start, they are generally discarded due to not performing at par with non-parametric models; the fault though is not always of the model's. This article is a brief overview on how models are often corrupted due to the violation of the below assumptions:
2. Independent variables are additive.
3. Autocorrelation (No relationship between residual terms, this translates to no relationship between each datapoint)
4. Multicollinearity: X variables are not correlated
5. Homoskedasticity (Error term has a constant variance)
6. Residuals are normally distributed ( This should be ideally; but meh, you can get away with it)
Terms you will see throughout the articles:
y : Actual value of dependent variable
y^ : Predicted value of dependent variable
residual error: y - y^
Now, we can have a look how volition of each of the assumption can wreck your model.
1. Linear relationship
Due to the parametric nature of linear regression, we are limited by the straight line relationship between X and Y. If the relationship is non-linear, all the conclusions drawn from the model are wrong, and this leads to wide divergence between training and test data.
How to identify:
For example looking at the below plot, there is clearly a violation of the assumption
Dealing With It:
If the residual plots show signs of non-linearity, quadratic transformations like logX,?√X, and?X^2,X^4,etc. do help in building a better model. For example, using X^2 on the above model, we see that the non linearity conditions have eased; through more experiments we can often find a better transformation
2. Additive relationship between dependent variables
This is perhaps the most violated assumption, and the primary reason why tree models outperform linear models on a huge scale. Since output of linear regression/logistic regression is dependent on the sum of the variables multiplied by their coefficients, the assumption is that each variable is independent of the other. Which is rarely the case.
Think of it this way, suppose we are building a revenue model for Uber and we discover the coefficient for number of cars available in a city is 1000 and the same for number of drivers in a city is 100. So Uber buys 100 more cars, but hires no drivers. So the model will predict 1000*100 additional revenue; but alas there's no driver to drive them. This is called a synergy/ interaction between variables, and perhaps the reason why trees beat linear models.
领英推荐
How to identify:
As we saw in the above example, if we have 1000 cars, the linear model will predict additional revenue of 1000*100, or overestimate. But if we have 100 cars and 100 drivers, the model will underestimate the revenue. In short, along X1 = X2, there will be positive residuals, whereas along X1, X2 there will be negative residuals.
How to Deal with it:
Add an additional interaction factor (X1*X2) which takes into account the combined effect of X1 and X2
3. Autocorrelation
Autocorrelation refers to no correlation between residual errors. Residual errors are just (y - y^); and since y is constant; Autocorrelation refers to no correlation between y^s, or no correlation between different rows.
In days of under and over sampling, you might wonder what difference does this make? Well, from a prediction point, none. But if you are looking from a statistical confidence interval, it makes a lot of difference.
Confidence intervals are inversely correlated to the number of observations. Think of it this way, the more the number of points you have, the more confident you are about your model. Say we doubled all the observations; we will get the same exact model, but our confidence intervals will drop by √2.
But again, this only affects the confidence interval, so if you are looking for predictions and not statistical confidence, this won't make a difference.
4. Collinearity
Collinearity is the presence of highly correlated variables within X. Again, this is one of the effects that don't really affect the prediction to a great extent. But it really creates a mess with the model's interpretability. This makes intuitive sense, because if two variables are highly correlated, you wouldn't know what is actually explaining the variance in dependent variable Y.
How to identify:
How to Deal with it: Drop it!
5. Homoskedasticity (Error term has a constant variance)
This is perhaps the silent killer among all the assumptions! Everything about linear regression: the hypothesis tests, the standard errors and the confidence intervals; all depend on the assumption that the residual errors have constant variance.
How to identify it:
Pretty simple, look at the variance of residuals against y, there should not be any noticeable pattern
How to deal with it:
A log transformation works well most of the times