Teaching Note: Linear Regression Explained
by Auther

Teaching Note: Linear Regression Explained

Linear Regression

Linear and logistic regressions are the forms of algorithms students learn at the very first as the part of statistics and data science learning path. However, there are so many forms of regressions, which are used depending on the context and type of the problem. However, linear regression is considered an essential concept of data science and machine learning. In this document, we will explain linear regression and how to perform this in a python environment.?

Linear Regression in Data Science and Machine Learning

In data science, the concepts of Linear Regression are taught in Statistics and in Machine Learning as part of the Supervised Machine Learning Methods.?

No alt text provided for this image

Problem Statements:-

  1. A restaurant chain wants to understand future revenue and profits.?
  2. What can be the property rates in Gurugram for the next 5-Years from now?
  3. How many customers will place orders on our web platform this Diwali?
  4. What will be the future sales in the coming festive season??

Can you think of some more problems where simple predictions are required??

Exactly, there comes the linear regression play its role? The above problem statements can be addressed with the help of linear regression.

So what is linear regression??

If we simply search on Google 'what is linear regression', wikipedia.com gives one line answer: “Linear regression is the most basic and commonly used predictive analysis”.?

In simple language, it can be explained that Linear Regression is the simplest form of predictive analysis which uses one set of variables to predict the value of another.?

Dependent and Independent Variables:?

The variable which we want to predict is known as the dependent variable and the variables which are used to predict the other variable are known as independent variables.

The regression equation:?

The linear regression predicts the dependent variable by estimating the coefficients of the independent variables through a linear equation:

Yi = B0 + B1Xi +Ei

Where?

Yi is the independent variable

B0 is the Constant?

B1 is the Slope

Xi is the independent variable?

Ei is the random error

Graph of the linear regression:?

No alt text provided for this image

Random Errors AKA Residuals:

Random errors are also known as residuals which can be calculated by summing up the values found after subtracting actual values from the predicted values.

Ei=Ypredicted-Yactual?

Where Ypredicted = B0 + B1Xi

The best fit line in linear regression?

As we can see in the above graph taking the independent variable on X-axis and dependent on the Y-axis, we can plot a scatter plot and the best fit line is the line which finds the trend in the plot having the minimum sum of the errors.

The Evaluation Metrics:?

Evaluation metrics are used to assess the strength of the linear regression model. The evaluation metrics can tell how accurate our model can predict with respect to the actual observed values. There are two main metrics used to evaluate a regression model.

  1. R-Squared or Coefficient of Determination: The value of the R-squared ranges between 0 to 1. The higher the value the more our model fits the data. It explains how well our model has captured the variance of the data.

Mathematically it is represented as follows:

???????????R2 = 1 – ( RSS/TSS )?

Where RSS stands for Residual Sum of Squares and TSS stands for Total Sum of Squares

  • RSS is measured by finding the difference between expected and actual output by the following formula?

No alt text provided for this image

  • TSS is measured by finding the sum of errors in the data points of the target variable. Mathematically it is represented as follows:?

No alt text provided for this image

2. Root Mean Square Value: It is the square root of the variance of the residuals and is represented mathematically by the following formula:

No alt text provided for this image

Linear Regression Assumptions

  1. Linearity: Relationship between the X independent variable and the Y dependent variable should be linear
  2. Independence: Observations should be independent of each other. There should not be a correlation between the observations.
  3. Homoscadesity: Variance of the residuals should be the same given any value of the X variables
  4. Normality: The residual means should be equal to zero or near zero to follow the normality.?

Overfitting and Underfitting in Linear Regression

Overfitting in Linear Regression: When the model starts fitting itself to the noise of the data and not much significant variables that it affects the model performance on the unseen future data and test data, then it is called overfitting.?

Dealing with Overfitting

The following are the methods of dealing with overfitting in linear regression:?

  • Cross Validation?
  • Regularization?
  • If the variables are lesser then add more with cleaner data?
  • If the variables are more then remove some with feature selection

Underfitting in Linear Regression: When our regression model learns lesser by ignoring some of the variable data points and doesn’t fit well that it affects the performance of the prediction then this is called underfitting.

Methods to Deal with Underfitting?

  • Increase the model complexity to fit well with the data?
  • Remove noise from the data?
  • Increase variables and data points?

Bias Variance Trade-Off in Linear Regression?

  • Bias: Bias is defined as the simplified assumptions made by the model by which it can predict the target variable easily
  • Variance: Variance is the amount that the target variable estimate will change given the new training data.
  • The Trade-Off: Our regression model has to find the balance between bias and variance, as bias and variance have an inverse relationship. This means an increase in bias will decrease the variance and vice versa.?

No alt text provided for this image

Steps to Perform Linear Regression in Python

  1. Install the python?
  2. Open the notebook
  3. Import the NumPy, pandas, matplotlib.pyplot and sklearn libraries.
  4. Read the data file
  5. Make a data frame
  6. Perform Exploratory data analysis with Numpy, Pandas, and Matplotlib
  7. Split data in dependent and independent variables?
  8. Split the data in train and test
  9. Perform the Linear Regression?
  10. Check for the model performance?
  11. Check for under and overfitting?
  12. Tune the model to improve the performance?
  13. Perform the prediction?

Qualitative Questions:

  1. What are the evaluation metrics?
  2. Where do we use regularization techniques?
  3. What are the applications of regression analysis?
  4. List five use cases of regression analysis.
  5. What is Bias and Variance Trade-Off?
  6. What are underfitting and overfitting?
  7. What is the error term in the regression equation?
  8. What are the regression assumptions?

Coding Questions?

  1. How to split the train test data in a python environment?
  2. What is the popular python Library used for ML in python?
  3. Perform the linear regression on sklearn Boston House Prices data.
  4. Assess the model performance of the Boston Price prediction.

References:

  1. Analytics Vidya?
  2. Boston University?
  3. Wikipedia?
  4. Kaggle??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了