登录查看更多内容

You are into Data Science? Learn Linear Regression first (Introduction, some Pitfalls and how to avoid them)

Sourav Nandi

Building Dezerv | IIT Kanpur | Ex- Morgan Stanley

发布日期: 2018年10月15日

"Simplicity is the Ultimate Sophistication"- Leonardo Da Vinci

I know- We all love "cool" models with big names. But the fact is, simple models like Linear Regression performs very well in many a practical cases. If you are interested in Data Crunching, its good to have an understanding of the following concepts, which finds repeated applications in different more complex ML models.

Linear Regression models the relationship between a scalar response (or dependent variable y) and one or more explanatory variables (or independent variables X). This is a very simple yet powerful approach for prediction of the dependent variable. It has got extensive applications in areas like Time Series Analysis, Finance (eg, Capital Asset Pricing Model), Epidemiology and other Social Sciences, and Machine learning.

Below is a practical example from onlinestatbook.com . Here we believe A student's University GPA (our y-variable) can be explained by his or her High School GPA. Of course, this is a very naive model, which won't give too high accuracy. But as we can see from the plot, our Regression Curve (The Red Line) gives a roughly accurate trend. Adding more relevant predictors (like Study Hours, Courses taken in university, ) would give us a better model.

Similarly, We can try to predict the sales(y) as a function of different modes of advertising, like TV, Newspaper and Internet. The early evidences of relating Tobacco Smoking with Cancer and Morbidity was due to some observational studies employing Regression Analysis. Though this created some disputes between scientists, as Correlation does not always imply Causation.

And now you understand why you took the Statistics-101 class, back in college!

(Comic Credit- xkcd)

Usually, we assume a linear relationship between y and X like y = X*b+ error , and estimate the parameters of the model by standard techniques like Least Squares, Maximum Likelihood Estimation, Adaptive Estimation, Least Angle Regression etc. Linear regression has also been extended to Generalized Linear Models, Heteroscedastic Models , Hierarchical Linear Models and Measurement Error Models. Follow this wikipedia article for more details- https://en.wikipedia.org/wiki/Linear_regression . Comment below if you have any particular query regarding these techniques.

If you know R, here is my implementation of Linear Regression in my Github- https://github.com/souravstat/Statistical-Learning-Solved-excercises-from-ISLR-book/blob/master/ch3.R . Here is a more detailed implementation from Aishwarya Ramachandran in Python- https://www.dhirubhai.net/pulse/tutorial-3-applying-linear-regression-python-aishwarya-c-ramachandran/.

" All models are Wrong, but some are Useful"- George Box

Essentially, the "Usefulness" of a particular model depends upon how its assumptions suits the real world application. Assumptions of Linear Regression includes Linearity (in Parameters), Constant Variance, Independence of Errors etc. The analysis might go south if these are violated to a large extent! So, here are some common problems while implementing Linear Regression (It started with this article, where I got a very enthusiastic response from my friends).

Here is how to deal with these practical problems while fitting a Regression Line. This is equally applicable to other regression models discussed here. I am roughly following the guidelines from the ISLR book.

1) Non-Linearity: We can plot the residuals(errors) against fitted values to find out about non-linearity in data. Use Non-Linear transformations on the predictors( eg, algebraic, logarithm or power transform- see figure in right) to use linear regression methods.

2) Outliers are observations with unusual Y-values. Calculate the Studentized Residuals by dividing each residual e[i] by its estimated standard error. If the absolute value is >3, its a possible outlier. Carefully observe the variable, and decide whether to Remove them, or to proceed with Imputation (replacement with a suitable value like mean/median).

3) Observations with unusual X-values are called High-Leverage Points. Calculate the value of leverage statistic h ( https://lnkd.in/fURc8ji) ) which is bounded in [0,1]. A higher magnitude denotes higher leverage. As we can see in the following example (Source- https://slideplayer.com/slide/9173878/), Though the Data Cloud is similar in both the cases, inclusion of the leverage point (The Red obsn) completely changes the Regression Curve

We can also use the Cook’s Distance plot (Plot of Studentized Residual vs Leverage for each point) to find such points (called Influential Points, as they significantly impact model behavior). For a more detailed understanding of Leverage and Influence diagnostics, refer to this Lecture Note from Prof Shalabh, IIT Kanpur.

4) Correlation Of Error Terms: Sometimes, the error terms in the model get correlated. For example, sometimes in Time Series data, observations corresponding to adjacent time points might get correlated. This also happens if populations get clustered in some way (eg, Members of same family, Persons exposed to same diet/ similar environmental factors etc). this is why we use Controlled Experiment to minimize effects of unwanted variables (which are not included as independent variables).

5) Heteroscedasticity, or Non-Constant Variance of Error Terms is another vital pitfall. We can plot the residuals (sometimes in Standardized format) against the Predicted Values to find out about this. As we can see from following Fig. (b), the funnel like shape denotes Heteroscedasticity. We can use a transformation using a Concave function like Log or Root to Transform Y, resulting in greater amount of shrinkage of the Larger values. Thus, Linear Regression becomes applicable, and we get a curve like (a).

6) Multicollinearity: It refers to the situation when two or more predictor variables get closely related to each other. This is bad, as this introduces Redundancy in the model. The least square optimization method gets affected (ie, becomes Unstable), as the coefficient matrix X can not achieve full Rank in this case. This problem is especially prevalent in Survival Analysis, and for modelling Interest Rates for different Maturity terms (as the rates move together).

Another popular(!) practice ensuring Perfect Multicollinearity is the dummy variable trap; including a dummy variable for every category (e.g., for each of the 12 months) alongwith the constant term as predictors.

This can be detected by Random Perturbation of the Data (Noting the change in coefficients after re-running the regression many times, adding random noise to the data ). Another method is to calculate the Variance Inflation Factor. A VIF > 5 denotes presence of multicollinearity.

THAT's ALL, FRIENDS!

Hope you have enjoyed this article. Let me know about YOUR experiences with Linear Regression!

Liked the article? Feel free to send a "Hello"! And Connect with me on different platforms . I am active on Github, LinkedIn, Quora, Twitter, and Medium. My handle : souravstat

Read my other articles and find more Actionable Insights (And some sneak peek into my thought process!) - https://www.dhirubhai.net/pulse/my-linkedin-brain-dump-some-posts-from-2018-i-enjoyed-sourav-nandi/

Saurabh Dwivedy

Principal IT Architect at Boston Consulting Group (BCG)

6 年

Keep rocking Sourav Nandi!

1 次回应

Parth Suresh

Machine Learning Engineer @ Meta

6 年

Very informative! Looking forward to the next one..

1 次回应

Dr. Tapas Chakravarty

Principal Scientist at TCS Research (Senior Member of IEEE, MIET); Recipient of 'Distinguished Scientist' award from TCS Research

6 年

Most elegant. Thank you for this note

1 次回应

Sourav Nandi

Building Dezerv | IIT Kanpur | Ex- Morgan Stanley

6 年

Which one of these 6 pitfalls have you found most prevalent in your journey with Data? Anything to add? -??Saswata Sahoo, Hieu,?Avik Sarkar,, Jalaj, Nikhil, Himel Mallick,?Vincent, Vin.

4 次回应

Aritra Chattaraj

6 年

Wonderful explanation and thank you so much for accompanying them with the pictures, that surely puts things into perspective.

1 次回应

查看更多评论

要查看或添加评论，请登录

Sourav Nandi的更多文章

9 Tips for Students appearing in an Online Entrance Exam (like GATE or IIT JAM)

2020年2月3日

9 Tips for Students appearing in an Online Entrance Exam (like GATE or IIT JAM)

GATE and JAM are two very popular exams for admission to the various postgraduate engineering/science programmes…

5 条评论
My LinkedIn Brain Dump-Some of my posts from 2018 that I enjoyed writing/sharing (I bet you have missed some of them!)

2018年10月17日

My LinkedIn Brain Dump-Some of my posts from 2018 that I enjoyed writing/sharing (I bet you have missed some of them!)

I clicked the "post" button for the first time almost a year ago. And never looked back since! In LinkedIn, I have met…

17 条评论
Some Excellent Resources in Data Science

2018年6月11日

Some Excellent Resources in Data Science

My Earlier Article reached more than 50K people. There I shared some excellent resources related to Data Science .

8 条评论
Using Open Data for Digital Business- A MOOC from Univ. Of London

2018年1月16日

Using Open Data for Digital Business- A MOOC from Univ. Of London

I just completed the course USING OPEN DATA FOR DIGITAL BUSINESS offered by ROYAL HOLLOWAY, UNIVERSITY OF LONDON…

1 条评论
Some Handwritten Notes Summarizing Properties of Most Important Probability Distributions (Useful in Statistics and Data Science practice)

2017年12月22日

Some Handwritten Notes Summarizing Properties of Most Important Probability Distributions (Useful in Statistics and Data Science practice)

Over the past few years of my journey in Statistics and Data Science, I have been repeatedly using some important…

7 条评论

See all articles

You are into Data Science? Learn Linear Regression first (Introduction, some Pitfalls and how to avoid them)

Sourav Nandi

Building Dezerv | IIT Kanpur | Ex- Morgan Stanley

Sourav Nandi的更多文章

社区洞察

其他会员也浏览了

Ensemble Techniques for Decision Tree

Bayesian Thinking in Modern Data Science

How Data Scientists Have an Edge at Investing in the Stock Market: Harnessing Analytical Skills for Superior Market Performance

Curse of Dimensionality, and How to Manage It

Elevating Data Science: Beyond Algorithms to Business Insights

Casual Inference is DS with context

10 important quantitative research techniques for a data scientist.

Bayesian methods which use the data as priors - Top 3 methods in my opinion

Day 4: Random Forest

Mindless Data

Sourav Nandi的更多文章

9 Tips for Students appearing in an Online Entrance Exam (like GATE or IIT JAM)

My LinkedIn Brain Dump-Some of my posts from 2018 that I enjoyed writing/sharing (I bet you have missed some of them!)

Some Excellent Resources in Data Science

Using Open Data for Digital Business- A MOOC from Univ. Of London

Some Handwritten Notes Summarizing Properties of Most Important Probability Distributions (Useful in Statistics and Data Science practice)

社区洞察

其他会员也浏览了

Ensemble Techniques for Decision Tree

Bayesian Thinking in Modern Data Science

How Data Scientists Have an Edge at Investing in the Stock Market: Harnessing Analytical Skills for Superior Market Performance

Curse of Dimensionality, and How to Manage It

Elevating Data Science: Beyond Algorithms to Business Insights

Casual Inference is DS with context

10 important quantitative research techniques for a data scientist.

Bayesian methods which use the data as priors - Top 3 methods in my opinion

Day 4: Random Forest

Mindless Data