登录查看更多内容

Violating Linear Regression Assumptions: A guide on what not to do

Satyabrata Mishra

Manager, Data & AI Foundation @ Tide

发布日期: 2021年10月30日

Parametric models have lost their sheen in the age of Deep Learning. But for smaller datasets, and when interpretability outweighs predictive power, models like linear and logistic regressions still hold the sway. While most data scientists evaluate at least some form of regression models at start, they are generally discarded due to not performing at par with non-parametric models; the fault though is not always of the model's. This article is a brief overview on how models are often corrupted due to the violation of the below assumptions:

There is a linear relationship between X(read independent variables) and Y (read dependent variables)

2. Independent variables are additive.

3. Autocorrelation (No relationship between residual terms, this translates to no relationship between each datapoint)

4. Multicollinearity: X variables are not correlated

5. Homoskedasticity (Error term has a constant variance)

6. Residuals are normally distributed ( This should be ideally; but meh, you can get away with it)

Terms you will see throughout the articles:

y : Actual value of dependent variable

y^ : Predicted value of dependent variable

residual error: y - y^

Now, we can have a look how volition of each of the assumption can wreck your model.

1. Linear relationship

Due to the parametric nature of linear regression, we are limited by the straight line relationship between X and Y. If the relationship is non-linear, all the conclusions drawn from the model are wrong, and this leads to wide divergence between training and test data.

How to identify:

When one of the X variables is non linear: Plot the residuals (y???y?) vs each of the predictors x
When there are too many variables / Y is non linear: Plot the residual plot (y???y?) vs the

For example looking at the below plot, there is clearly a violation of the assumption

Dealing With It:

If the residual plots show signs of non-linearity, quadratic transformations like logX,?√X, and?X^2,X^4,etc. do help in building a better model. For example, using X^2 on the above model, we see that the non linearity conditions have eased; through more experiments we can often find a better transformation

2. Additive relationship between dependent variables

This is perhaps the most violated assumption, and the primary reason why tree models outperform linear models on a huge scale. Since output of linear regression/logistic regression is dependent on the sum of the variables multiplied by their coefficients, the assumption is that each variable is independent of the other. Which is rarely the case.

Think of it this way, suppose we are building a revenue model for Uber and we discover the coefficient for number of cars available in a city is 1000 and the same for number of drivers in a city is 100. So Uber buys 100 more cars, but hires no drivers. So the model will predict 1000*100 additional revenue; but alas there's no driver to drive them. This is called a synergy/ interaction between variables, and perhaps the reason why trees beat linear models.

Ajit Jaokar 3 个月前

4 algorithms machine learning engineers should know

Naveen Joshi 7 年前

Using Generative Adversarial networks (GANs) to…

Ajit Jaokar 3 年前

How to identify:

As we saw in the above example, if we have 1000 cars, the linear model will predict additional revenue of 1000*100, or overestimate. But if we have 100 cars and 100 drivers, the model will underestimate the revenue. In short, along X1 = X2, there will be positive residuals, whereas along X1, X2 there will be negative residuals.

How to Deal with it:

Add an additional interaction factor (X1*X2) which takes into account the combined effect of X1 and X2

3. Autocorrelation

Autocorrelation refers to no correlation between residual errors. Residual errors are just (y - y^); and since y is constant; Autocorrelation refers to no correlation between y^s, or no correlation between different rows.

In days of under and over sampling, you might wonder what difference does this make? Well, from a prediction point, none. But if you are looking from a statistical confidence interval, it makes a lot of difference.

Confidence intervals are inversely correlated to the number of observations. Think of it this way, the more the number of points you have, the more confident you are about your model. Say we doubled all the observations; we will get the same exact model, but our confidence intervals will drop by √2.

But again, this only affects the confidence interval, so if you are looking for predictions and not statistical confidence, this won't make a difference.

4. Collinearity

Collinearity is the presence of highly correlated variables within X. Again, this is one of the effects that don't really affect the prediction to a great extent. But it really creates a mess with the model's interpretability. This makes intuitive sense, because if two variables are highly correlated, you wouldn't know what is actually explaining the variance in dependent variable Y.

How to identify:

How to Deal with it: Drop it!

5. Homoskedasticity (Error term has a constant variance)

This is perhaps the silent killer among all the assumptions! Everything about linear regression: the hypothesis tests, the standard errors and the confidence intervals; all depend on the assumption that the residual errors have constant variance.

How to identify it:

Pretty simple, look at the variance of residuals against y, there should not be any noticeable pattern

How to deal with it:

A log transformation works well most of the times

要查看或添加评论，请登录

查看全部

Violating Linear Regression Assumptions: A guide on what not to do

Satyabrata Mishra

Manager, Data & AI Foundation @ Tide

1. Linear relationship

2. Additive relationship between dependent variables

领英推荐

3. Autocorrelation

4. Collinearity

更多精彩文章

社区洞察

其他会员也浏览了

Extracting Graph Level Features from Graphs for Machine Learning Models: Part 4 of X of my notes

Extracting Link Level Features from Graphs for Machine Learning Models: Part 3 of X of my notes

Probability Theory

Demystifying Machine Learning: A Guided Tour of the Top 10 Algorithms

Understanding Linear Models for Regression and Classification

Common Machine Learning Algorithms

AI_Part_3_Regression vs Classification Models

Understanding Data Science, AI, Deep Learning, and Machine Learning [2024 Guide]

Expected Time of Arrival Predictor

10 Must-Know Machine Learning Algorithms for 2024

1. Linear relationship

2. Additive relationship between dependent variables

领英推荐

3. Autocorrelation

4. Collinearity

Training LSTMs to generate Chicken Smoothie and other Nightmare Recipes

2024年1月8日

Generating Fake Celebrities with WGAN

2023年9月7日

Generating Images with Deep Convolutional GANs

2023年8月31日

Image Morphing with Latent Space Arithmetic

2023年8月21日

Generative Modelling with Variational Autoencoders

2023年8月16日

Generative Modelling with Autoencoders

2023年8月9日

Deep Reinforcement Learning Intuitions

2023年1月15日

Time Series Forecasting with Recurrent Neural Networks III: Predicting Nifty 50 stock prices using RNN & LSTM

2022年2月4日

Time Series Forecasting using Recurrent Neural Network II : Forecasting Multiple Steps Ahead

2022年2月1日

Time Series Forecasting with Reccuring Neural Networks

2022年1月29日

社区洞察

其他会员也浏览了

Extracting Graph Level Features from Graphs for Machine Learning Models: Part 4 of X of my notes

Extracting Link Level Features from Graphs for Machine Learning Models: Part 3 of X of my notes

Probability Theory

Demystifying Machine Learning: A Guided Tour of the Top 10 Algorithms

Understanding Linear Models for Regression and Classification

Common Machine Learning Algorithms

AI_Part_3_Regression vs Classification Models

Understanding Data Science, AI, Deep Learning, and Machine Learning [2024 Guide]

Expected Time of Arrival Predictor

10 Must-Know Machine Learning Algorithms for 2024