New Aspects to consider while moving from Simple Linear Regression to Multiple Linear Regression

No alt text provided for this image


Simple Linear Regression: The most elementary model which explains the relationship between a dependent variable and one independent variable using a straight line. However, in real-life scenario one independent variable might not be good enough to explain the output or dependent variable.

So, it might be a good idea to use multiple variables to explain a dependent variable. 

Advantages:

  1. Adding variables help add information about the variance in independent variable.
  2. In general, we expect explanatory power to increase with increase in variables 

Hence, this brings us to multiple linear regression which is an extension to simple linear regression.

Below are the few aspects that need to be taken into consideration while moving from SLR to MLR:

Overfitting:

Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points. When we keep increasing the variables to the model, the model may fit the train set ‘too well’, might not generalize well . It will result in high train accuracy, low test accuracy which is a classical symptom of over-fitting.

No alt text provided for this image


Multicollinearity:

Multicollinearity is a state of very high inter-correlations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.

No alt text provided for this image


 Multicollinearity mainly effects;

1)     Interpretation: Does “change in Y, when all others are held constant” apply?

2)     Inference:

a.      Coefficients swing wildly, signs can invert 

b.     p-values are, therefore, not reliable

Detecting Multicollinearity:

Following are the two ways to detect multicollinearity in a model;

1.     Looking at pairwise correlations or correlations between the independent variables;

No alt text provided for this image

Some of the pair of variables can be highly correlated, hence, when the model is built, one of the variables from each of the pairs of variables might turn out to be redundant for the model. 

2. Checking the Variance Inflation factor (VIF):Sometimes pairwise correlations aren’t enough i.e one variable might not completely explain some other variable but some of the variables combined might be able to do that. Basically, VIF calculates how well one independent variable is explained by all the other independent variables combined.

The common heuristic to follow for VIF is:

  1. Variable with VIF Value of 10 is considered as a high value and should be eliminated.
  2. Variable with VIF value of 5 is considered as okay but is worth inspecting.
  3. Variable with VIF value of less than 5 need not be eliminated.

How to deal with Multicollinearity:

Following methods can be used to deal with multicollinearity;

a) Dropping variables

i)  Drop the variable that is highly correlated with others

ii) Pick the business interpretable variable (if interpretation and explicability important)

b) Create new variable using the interactions of the older variables

i)  Add interaction features, i.e. features derived using some of the original

ii) Variable transformations

Feature Selection:

      Following are the methods of optimal feature selection:

I.  Manual Feature selection:

  a) Build the model with all the features

b) Drop the features that are least helpful in prediction (high p-value)

c) Drop the features that are redundant (using correlations and VIF)

d) Rebuild model and repeat

II. Automated feature selection:

e) Recursive Feature Elimination (RFE)

f) Forward/Backward/Stepwise Selection based on AIC

It is generally recommended to use a combination of automated (coarse tuning) + manual (fine tuning) selection in order to get an optimal model.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了