登录查看更多内容

Machine Learning : 'Regression' - Day 3

Shivam Panchal

Data Scientist | Machine Learning Engineer

发布日期: 2018年2月28日

In the last to last post, we discussed about what is Regression and in the last one, we talked about the assumptions or so called limitations of linear regression.

Linear Regression is assumed to be the simplest machine learning algorithm the world has ever seen, and yes! it is!

Anyways! we also discussed how your model can give you poor predictions in real time if you don't obey the assumptions of linear regression. Whatever you are going to predict, whether it is stock value, sales or some revenue, linear regression must be handled with care if you want to get best values from it.

Linear regression says, the data should be linear in nature, there must be a linear relationship. But, wait! the real world data is always non-linear. Yes, so, what we should we do, should we try to bring non-linearity into the regression model, or check out the residuals and fitted values, keep applying transformations and working harder and harder to get the best predictive model using linear regression.

Now, the question is, should it be considered as the solution or is there any other way to deal with this, so that I can get a better predictive model without getting into these assumptions of linear regression.

Yes! there is a solution, in fact a bunch of solutions.

There are many different analytic procedures for fitting regressive models of nonlinear nature (e.g., Generalized Linear/Nonlinear Models (GLZ), Generalized Additive Models (GAM), etc.), or more better models called tree based regressive models!

Most of us know about Random Forest and Decision Trees, it is very common, in case of classification or regression, they often perform far better than other models.

In this post, I will be mainly talking out tree based models such as Decision Trees and ensemble tree based like Random forests. Tree based model have proven themselves to be both reliable and effective, and are now part of any modern predictive modeler’s toolkit.

But, there are some cases when a linear regression is assumed better than a tree based models like the following cases.

When the underlying function is truly linear
When there are a very large number of features, especially with very low signal to noise ratio. Tree based model have a little trouble modeling linear combinations of a large number of features.

The point is: there are probably only a few cases in which linear models like SLR is better than tree based models or other non-linear models as these fits the data better from the get-go without transforms.

They’re more forgiving in almost every way. You don’t need to scale your data, you don’t need to do any monotonic transformations (log square root etc). You often don’t even need to remove outliers. You can throw in features, and it’ll automatically partition the data if it aids the fit. You don’t have to spend any time generating interaction terms as in case of linear models. And perhaps most important: in most cases, it’ll probably be notably more accurate.

The bottom line is: You can spend 3 hours playing with the data, generating features and interaction variables and get a 77% r-squared; and I can “from sklearn.ensemble import RandomForestRegressor” and in 3 minutes get an 82% r-squared.

I am not creating a hype for these tree model, I am going to show you that below in an implementation with python and same housing data which we have been using since last two posts.

Let me explain it using some examples for clear intuition with an example.

Linear regression is a linear model, which means it works really nicely when the data has a linear shape. But, when the data has a non-linear shape, then a linear model cannot capture the non-linear features. So in this case, you can use the decision trees, which do a better job at capturing the non-linearity in the data by dividing the space into smaller sub-spaces depending on the questions asked.

Now, the question is when do you use linear regression vs Decision Trees? I guess the Quora answer here would do a better job than me, at explaining the difference between them and their applications. Let me quote that for you:

Let’s suppose you are trying to predict income.

The predictor variables that are available are education, age, and city.

Now in a linear regression model, you have an equation with these three attributes. Fine. You’d expect higher degrees of education, higher “age” and larger cities to be associated with higher income.

But what about a PhD who is 40 years old and living in Scranton Pennsylvania? Is he likely to earn more than a BS holder who is 35 and living in Upper West SIde NYC? Maybe not. Maybe education totally loses its predictive power in a city like Scranton? Maybe age is a very ineffective, weak variable in a city like NYC?

This is where decision trees are handy. The tree can split by city and you get to use a different set of variables for each city. Maybe Age will be a strong second-level split variable in Scranton, but it might not feature at all in the NYC branch of the tree. Education may be a stronger variable in NYC.

Decision Trees, be it Random Forest or GBM, handle messier data and messier relationships better than regression models. ANd there is seldom a dataset in the real world where relationships are not messy. No wonder you will seldom see a linear regression model outperforming RF or GBM

So, this is the main idea behind tree (Decision Tree Regression) and ensemble based models (Random Forest Regression/Gradient Boosting Regression/ Extreme Boosting Regression).

The following is the link to repo. Check out the Day 3 implementation.

I used the Boston Housing Data to train all the different available regressive models present in the sklearn.

These were my results for R-squared on train and testing data (30%). You can find the following results there. The repo is open to contribute, I invite fresh buddies to improve my model there, apply some feature engineering, or do some hyper-parameter tuning using Grid Search to make it more better.

As you can GBM/RF are performing the best.

Below are the links to almost all the regression techniques in sci-kit learn.

Ordinary least squares Linear Regression:

2. Linear least squares with l2 regularization:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

3. Linear Model trained with L1 prior as regularizer (aka the Lasso):

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

4. Support Vector Regression:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

5. A decision tree classifier:

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

6. A random forest regressor.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

7. Linear model fitted by minimizing a regularized empirical loss with SGD:

SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html

8. Gradient Boosting for regression:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

Feel Free to Criticize!

See you in the next!!

Nishamathi Kumaraswamy, PhD

Technical Project Manager at Pandemic Response Accountability Committtee at Council of the Inspectors General on Integrity and Efficiency

7 年

Great job explaining w/o going too much into the math of regularization etc., one concern i have had with random forests (RF) is that the method might not be taking into account the interaction happening between 2 or more variables.... i think i feel this way because each tree in RF uses one feature/variable that gives the best classification of the known training results and then uses next feature and so on.... so where does the combination of variables predictive power being captured?? Is my question valid considering the math behind the tree ensemble? Any comments/thoughts would be much appreciated.!

1 次回应

Anup G Prasad

Seasoned AI Engineer | Spearheading Research and Pioneering Intelligent Solutions | 10 Years at the Forefront of AI & Machine Learning Innovation

7 年

Glad that i started reading up on ML. Till a few days back, i used to see such posts and cluelessly scroll up. Now atleast i understand the terminology.??

1 次回应

Bernardo .

Director, Data Science and Machine Learning @ Equinix | Artificial Intelligence | Head of AI | GenAI | Leadership

7 年

Nice article, thanks. I often notice that RandomForest as well as gradient boosting models performs just like the results you shared: very high, close to 100 accuracy in training and a fairly good performance in test (often better than other comparable models). I interpret that like some kind of overfitting because of the drop of performance when comparing train to test, usually it is a concern. In the end, I would prefer a model that is more stable when comparing train to test, even with a slightly worse test accuracy, because it could've potentially generalized better and should perform better on new unseen data. Do you agree?

2 次回应

Daniel Lu

Role P. M. P. / Skill S. F. H. ___ #AOI #Medical #OCT #Biological

7 年

Great job, keep going.

1 次回应

Snehanshu Sengupta

Senior Business Intelligence Analyst

7 年

nice one

1 次回应

查看更多评论

要查看或添加评论，请登录

Shivam Panchal的更多文章

Best Resources for Data Science Enthusiasts- A Complete List

2020年6月20日

Best Resources for Data Science Enthusiasts- A Complete List

Free Books R Python Libraries Libraries for Python Libraries for R Complete Beginner Resources ML, DL and RL in Python…
Machine Learning, Deep Learning and Artificial Intelligence Resources for all

2020年6月15日

Machine Learning, Deep Learning and Artificial Intelligence Resources for all

Here is a bunch of machine learning resources, thought I'd share it here. ★ are resources that were highly recommended…

1 条评论
Machine Learning 10: 'Recommendation System'

2018年7月18日

Machine Learning 10: 'Recommendation System'

Why do the we care about the Recommendation Systems? The answer to this question may be different based on different…
Machine Learning 9: 'Sequential Rule Mining'

2018年6月24日

Machine Learning 9: 'Sequential Rule Mining'

Sequential Rule Mining is a data mining technique which consists of discovering rules in sequences. Sequential Rule…

4 条评论
Machine Learning 8: 'Clustering Algorithms'

2018年6月7日

Machine Learning 8: 'Clustering Algorithms'

In the last week, we explored classification and Random Forest algorithm and that was a part of Supervised Machine…

2 条评论
Machine Learning 7:'Classification' Day 3

2018年3月24日

Machine Learning 7:'Classification' Day 3

In the last post, I discussed about Decision Tree. In this post, I will be discussing about Random Forest Algorithm…

9 条评论
Machine Learning 6:'Classification' Day 2

2018年3月14日

Machine Learning 6:'Classification' Day 2

Keep asking yes/no questions. With each question continue to significantly narrow down the space of possibly secrets.

6 条评论
Machine Learning : 'Classification' - Day 1

2018年3月9日

Machine Learning : 'Classification' - Day 1

In this post, we are starting off the classification, firstly, we will get into the difference between classification…

17 条评论
Machine Learning : 'Regression' - Day 4

2018年3月2日

Machine Learning : 'Regression' - Day 4

In this post which will be the last one on regression analysis, I will be discussing about the following topics in…

3 条评论
Machine Learning : 'Regression' - Day 2

2018年2月25日

Machine Learning : 'Regression' - Day 2

Welcome to the post, I will not bore you much with the theory behind, I will try to put it as easy as possible. In this…

3 条评论

See all articles

Machine Learning : 'Regression' - Day 3

Shivam Panchal

Data Scientist | Machine Learning Engineer

Shivam Panchal的更多文章

社区洞察

其他会员也浏览了

XGboost

Demystifying Machine Learning: A Guided Tour of the Top 10 Algorithms

ML Day 8: Basic ML Algorithms Every IT Professional Should Know

Decision Tree in Machine Learning.

Machine Learning (Classification models)

How (not) to use Machine Learning for time series forecasting: The sequel

Decision Tree

Unveiling the Art of Feature Selection in Machine Learning

What Is Lasso and Ridge Regression in Machine Learning?

Top 8 Machine Learning Algorithms Explained In Less Than 1 Minute Each

Shivam Panchal的更多文章

Best Resources for Data Science Enthusiasts- A Complete List

Machine Learning, Deep Learning and Artificial Intelligence Resources for all

Machine Learning 10: 'Recommendation System'

Machine Learning 9: 'Sequential Rule Mining'

Machine Learning 8: 'Clustering Algorithms'

Machine Learning 7:'Classification' Day 3

Machine Learning 6:'Classification' Day 2

Machine Learning : 'Classification' - Day 1

Machine Learning : 'Regression' - Day 4

Machine Learning : 'Regression' - Day 2

社区洞察

其他会员也浏览了

XGboost

Demystifying Machine Learning: A Guided Tour of the Top 10 Algorithms

ML Day 8: Basic ML Algorithms Every IT Professional Should Know

Decision Tree in Machine Learning.

Machine Learning (Classification models)

How (not) to use Machine Learning for time series forecasting: The sequel

Decision Tree

Unveiling the Art of Feature Selection in Machine Learning

What Is Lasso and Ridge Regression in Machine Learning?

Top 8 Machine Learning Algorithms Explained In Less Than 1 Minute Each