Simple and Multiple Linear Regression Model - Python and ScikitLearn

Basic Assumptions of a Linear Regression Model:

There are four assumptions associated with a linear regression model:

Linearity: The relationship between X and the mean of Y is linear.

Homoscedasticity: The variance of residual is the same for any value of X.

Independence: Observations are independent of each other.

Normality: For any fixed value of X, Y is normally distributed.

Theoretical Concepts:

We need to find the best fit line for the following : y=mx + c; m = slope, c = intercept

If my x is 0, then y is c i.e. Y-intercept. Within a unit change in X-axis, the change in the y-value is called slope or m. m = (y2 - y1) / (x2 - x1)

The summation of squared error should be minimised by the requisite values of m and c obtained. Here, cost function = distance b/w the best fit point and the actual point should be minimum. Thus, cost function = 1/2n ∑(y’ - y)2 where i ranges from 1 to m. Here, m is the number of points. Predicted points: y’ and actual points: y.

But, if we just use the above equation, we might get several best fit lines, and choosing just one might be time consuming.

So, what can we do?

Example: y = x; y’ = mx + c; here, we can assume the best fit line passes through origin and c = 0; thus y’ = mx

Now, we can substitute x = 1, and m = 1; then y’ = 1;

For x = 2, y’ = 2 and y = 2 and so on. This is actually my best fit line when my slope is 1. After getting this equation, we calculate our cost function and try to reduce it.?

Here, cost function, J(m) = 1/2n((1-1)2? + (2-2)2? + (3-3)2 ) = 0

With respect to every m value, we can plot our cost function. For m = 1, J(m) = 0.

For m = 0.5, x = 1; y = 0.5,

For x= 2, y = 1,

For x = 3, y = 1.5

Then our cost function, J(m) = 1/2n((1-0.5)2? + (2-1)2? + (3-1.5)2 ) = 0.58. Thus, a curvature of J(m) versus m can be plotted and we can see the gradient descent. How do we arrive at the global minimum?

Based on some m value, we get some initial J(m) value. In order to move downwards, we should use the convergence theorem.

Convergence theorem: m = m - ɑ??J/??m where ɑ is the learning rate

If the slope (??J/??m) is negative, the curve points downwards. When we get a negative slope, then m = m + (+ive smaller value). This step would be very very small and it would move slowly towards global minimum. If we take a larger alpha, then the jumps might be bigger and it might not converge after several iterations and keep oscillating.

At the global minima, the slope will be zero. This would be the slope of the best fit line. Until then, we would keep following the convergence theorem.

If I have multiple independent features, then each of the features will try to reach a global minimum.

For multiple linear regression, y = β? + β?x? + β?x? + β?x? + β?x? + β?x?; here, β? is the y-intercept; β?, β? and β? are changes in y with unit changes in x? , x? and x? respectively.


In regression, "multicollinearity" refers to predictors that are correlated with other predictors. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant. To handle such a multicollinearity situation, one solution is to remove highly correlated feature (check for P value in model summary and remove the one for which it is higher). Removing the highly correlated feature would work when we have quite fewer numbers of features and a smaller dataset. We also lose certain information when we remove some data. In case, we have a large dataset, we would go for Ridge or Lasso correlation.

R Square and Adjusted R Square:

R2 = 1 - SSres/SStot

Here, SSres is the sum of squares of residuals or errors = ∑(y’ - y)2?

SStot is the sum of average totals = ∑(y’ - ymean)2?

The closer the value of R2 to 1, the better the model is.

It will be less than zero only when the best fit line is worse than the average value.

As we go on adding new independent features, our R2 value usually increases. This is because the model then tries to apply some coefficient value such that our SSres value decreases. Then, resultantly, our R2 value increases. We should also note that this value will never decrease when we keep on adding independent features to it. But, there might be a scenario that the independent feature being added might not be correlated to the target output. Even though R2 value might show an increase in such a case, but it is basically not penalising the newly added features which do not have any correlation with the target. For this reason, we use adjusted R square.?

From the above formula, we can note that as the value of p increases when independent features which are not correlated to the target variable are added, N - p - 1 value decreases, leading to an overall decrease in the Adjusted R square value as a higher term is subtracted from 1 to achieve it. In case, the added features have correlation with the target variable, then R2 value would be higher such that the rest of the multiplication factor will become overwhelmed and would not make much of a difference to the overall R2 value.

Differences between R2 value and Adjusted R2 value:

Every time an independent variable is added to a model, R2 value increases, even if the independent variable is insignificant. It never declines, whereas adjusted R2 value increases only when the independent variable is significant and affects the dependent variable.

Adjusted R2 value is always less than or equal to R2 value.

Advantages of Linear Regression:

  1. Linear regression performs exceptionally well for linearly separable data
  2. Easy to implement and train the model
  3. It can handle overfitting using dimensionality reduction techniques, cross validation and regularization

Disadvantages of Linear Regression:

  1. Sometimes Lot of Feature Engineering Is required
  2. If the independent features are correlated it may affect performance
  3. It is often quite prone to noise and overfitting

Overfitting And Underfitting:

We will use polynomial linear regression to explain bias and variance tradeoff as well as overfitting and underfitting. Please note that if the degree of the polynomial is one, then the curve is a straight line. The sum of mean square error (cost function) is higher in that scenario when compared to a polynomial curve with degree greater than one. When error is very high for a training dataset, then the particular scenario is called underfitting. Suppose, we use a quite higher order polynomial such that almost all the training data points are fitted quite well by the model, this scenario is called overfitting. This is due to the fact that the accuracy would go down for the test data even though it is very high for the training data.?

Our main aim is to achieve an optimum solution such that accuracy is high for both training and test data. That model will give us low bias and low variance.?

In an underfitting scenario, we basically have high bias (error of the training data) and high variance (generalizability - error on the test data). For overfitting scenario, we have low bias but high variance.?

Overfitting Model Learning curve

Practical Implementation:

I. Multiple Linear Regression Using Python And ScikitLearn

Dataset used:

II. Checking for multicollinearlity in a small dataset

Dataset used:

In the above dataset, we noticed that there is no multicollinearity, but if one wants to see a case of multicollinearity, then follow the same steps used above on the following dataset:

Different Problem statement you can solve using Linear Regression: Price prediction (Houses, flights, Stocks)


