You are into Data Science? Learn Linear Regression first (Introduction, some Pitfalls and how to avoid them)
LRM and Its Pitfalls by Sourav Nandi

You are into Data Science? Learn Linear Regression first (Introduction, some Pitfalls and how to avoid them)


"Simplicity is the Ultimate Sophistication"- Leonardo Da Vinci

I know- We all love "cool" models with big names. But the fact is, simple models like Linear Regression performs very well in many a practical cases. If you are interested in Data Crunching, its good to have an understanding of the following concepts, which finds repeated applications in different more complex ML models.

Linear Regression models the relationship between a scalar response (or dependent variable y) and one or more explanatory variables (or independent variables X). This is a very simple yet powerful approach for prediction of the dependent variable. It has got extensive applications in areas like Time Series Analysis, Finance (eg, Capital Asset Pricing Model), Epidemiology and other Social Sciences, and Machine learning.

Below is a practical example from onlinestatbook.com . Here we believe A student's University GPA (our y-variable) can be explained by his or her High School GPA. Of course, this is a very naive model, which won't give too high accuracy. But as we can see from the plot, our Regression Curve (The Red Line) gives a roughly accurate trend. Adding more relevant predictors (like Study Hours, Courses taken in university, ) would give us a better model.

Similarly, We can try to predict the sales(y) as a function of different modes of advertising, like TV, Newspaper and Internet. The early evidences of relating Tobacco Smoking with Cancer and Morbidity was due to some observational studies employing Regression Analysis. Though this created some disputes between scientists, as Correlation does not always imply Causation.

And now you understand why you took the Statistics-101 class, back in college!

(Comic Credit- xkcd)


Usually, we assume a linear relationship between y and X like y = X*b+ error , and estimate the parameters of the model by standard techniques like Least Squares, Maximum Likelihood Estimation, Adaptive Estimation, Least Angle Regression etc. Linear regression has also been extended to Generalized Linear Models, Heteroscedastic Models , Hierarchical Linear Models and Measurement Error Models. Follow this wikipedia article for more details- https://en.wikipedia.org/wiki/Linear_regression . Comment below if you have any particular query regarding these techniques.

If you know R, here is my implementation of Linear Regression in my Github- https://github.com/souravstat/Statistical-Learning-Solved-excercises-from-ISLR-book/blob/master/ch3.R . Here is a more detailed implementation from Aishwarya Ramachandran in Python- https://www.dhirubhai.net/pulse/tutorial-3-applying-linear-regression-python-aishwarya-c-ramachandran/.

" All models are Wrong, but some are Useful"- George Box

Essentially, the "Usefulness" of a particular model depends upon how its assumptions suits the real world application. Assumptions of Linear Regression includes Linearity (in Parameters), Constant Variance, Independence of Errors etc. The analysis might go south if these are violated to a large extent! So, here are some common problems while implementing Linear Regression (It started with this article, where I got a very enthusiastic response from my friends).

Here is how to deal with these practical problems while fitting a Regression Line. This is equally applicable to other regression models discussed here. I am roughly following the guidelines from the ISLR book.

1) Non-Linearity: We can plot the residuals(errors) against fitted values to find out about non-linearity in data. Use Non-Linear transformations on the predictors( eg, algebraic, logarithm or power transform- see figure in right) to use linear regression methods.

2) Outliers are observations with unusual Y-values. Calculate the Studentized Residuals by dividing each residual e[i] by its estimated standard error. If the absolute value is >3, its a possible outlier. Carefully observe the variable, and decide whether to Remove them, or to proceed with Imputation (replacement with a suitable value like mean/median).

3) Observations with unusual X-values are called High-Leverage Points. Calculate the value of leverage statistic h ( https://lnkd.in/fURc8ji) ) which is bounded in [0,1]. A higher magnitude denotes higher leverage. As we can see in the following example (Source- https://slideplayer.com/slide/9173878/), Though the Data Cloud is similar in both the cases, inclusion of the leverage point (The Red obsn) completely changes the Regression Curve


We can also use the Cook’s Distance plot (Plot of Studentized Residual vs Leverage for each point) to find such points (called Influential Points, as they significantly impact model behavior). For a more detailed understanding of Leverage and Influence diagnostics, refer to this Lecture Note from Prof Shalabh, IIT Kanpur.

4) Correlation Of Error Terms: Sometimes, the error terms in the model get correlated. For example, sometimes in Time Series data, observations corresponding to adjacent time points might get correlated. This also happens if populations get clustered in some way (eg, Members of same family, Persons exposed to same diet/ similar environmental factors etc). this is why we use Controlled Experiment to minimize effects of unwanted variables (which are not included as independent variables).

5) Heteroscedasticity, or Non-Constant Variance of Error Terms is another vital pitfall. We can plot the residuals (sometimes in Standardized format) against the Predicted Values to find out about this. As we can see from following Fig. (b), the funnel like shape denotes Heteroscedasticity. We can use a transformation using a Concave function like Log or Root to Transform Y, resulting in greater amount of shrinkage of the Larger values. Thus, Linear Regression becomes applicable, and we get a curve like (a).

6) Multicollinearity: It refers to the situation when two or more predictor variables get closely related to each other. This is bad, as this introduces Redundancy in the model. The least square optimization method gets affected (ie, becomes Unstable), as the coefficient matrix X can not achieve full Rank in this case. This problem is especially prevalent in Survival Analysis, and for modelling Interest Rates for different Maturity terms (as the rates move together).

Another popular(!) practice ensuring Perfect Multicollinearity is the dummy variable trap; including a dummy variable for every category (e.g., for each of the 12 months) alongwith the constant term as predictors.

This can be detected by Random Perturbation of the Data (Noting the change in coefficients after re-running the regression many times, adding random noise to the data ). Another method is to calculate the Variance Inflation Factor. A VIF > 5 denotes presence of multicollinearity.

THAT's ALL, FRIENDS!

Hope you have enjoyed this article. Let me know about YOUR experiences with Linear Regression!

Liked the article? Feel free to send a "Hello"! And Connect with me on different platforms . I am active on GithubLinkedInQuoraTwitter, and Medium. My handle : souravstat

Read my other articles and find more Actionable Insights (And some sneak peek into my thought process!) - https://www.dhirubhai.net/pulse/my-linkedin-brain-dump-some-posts-from-2018-i-enjoyed-sourav-nandi/

Saurabh Dwivedy

Principal IT Architect at Boston Consulting Group (BCG)

6 年

Keep rocking Sourav Nandi!

Parth Suresh

Machine Learning Engineer @ Meta

6 年

Very informative! Looking forward to the next one..

Dr. Tapas Chakravarty

Principal Scientist at TCS Research (Senior Member of IEEE, MIET); Recipient of 'Distinguished Scientist' award from TCS Research

6 年

Most elegant. Thank you for this note

Sourav Nandi

Building Dezerv | IIT Kanpur | Ex- Morgan Stanley

6 年

Which one of these 6 pitfalls have you found most prevalent in your journey with Data? Anything to add? -??Saswata Sahoo, Hieu,?Avik Sarkar,, Jalaj, Nikhil, Himel Mallick,?Vincent, Vin.

Aritra Chattaraj

Data Scientist | "Reluctant" Data Engineer | Analytics | Business Intelligence | SQL | Python | Power BI | Tableau | dbt

6 年

Wonderful explanation and thank you so much for accompanying them with the pictures, that surely puts things into perspective.

要查看或添加评论,请登录

Sourav Nandi的更多文章

社区洞察

其他会员也浏览了