登录查看更多内容

Machine Learning : 'Regression' - Day 1

Shivam Panchal

Data Scientist | Machine Learning Engineer

发布日期: 2018年2月22日

Hi everyone!

This is the first post of my new series of articles which will be specifically focusing on Machine Learning Algorithms and its practical implementation in Python using real world data sets.

I assume that you have basic knowledge of python programming and have used the basic libraries such as Numpy, Scipy, Pandas, Matplotlib and Sklearn.

If you haven't, I am attaching the links here for you!

Numpy: https://bit.ly/2k4igLd

Scipy: https://bit.ly/1oyJqYk

Pandas: https://bit.ly/2qs1lAJ

Matplotlib: https://bit.ly/2EMuVNG

Sklearn: https://bit.ly/2j049C4

I wanted to provide a quick introduction to building models in Python, and what better way to start than one of the very basic models, linear regression?

This will be the first post about machine learning and I plan to write about more complex models in the future. Stay tuned! But for right now, let’s focus on linear regression.

I want to focus on the concept of linear regression and mainly on the implementation of it in Python.

Here we go!!

Chapter 1 : Regression

Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship basically means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too.

As you can see, a linear relationship can be positive (independent variable goes up, dependent variable goes up) or negative (independent variable goes up, dependent variable goes down) or other. Like I said, I will focus on the implementation of regression models in Python, so I don’t want to delve too much into the math under the regression hood, but I will write a little bit about it.

A Little Bit About the Math

A relationship between variables Y and X is represented by this equation:

Y`i = mX + b

In this equation, Y is the dependent variable — or the variable we are trying to predict or estimate; X is the independent variable — the variable we are using to make predictions; m is the slope of the regression line — it represent the effect X has on Y.

This is Simple Linear Regression (SLR). In a SLR model, we build a model based on data — the slope and Y-intercept derive from the data; furthermore, we don’t need the relationship between X and Y to be exactly linear. SLR models also include the errors in the data (also known as residuals). I won’t go too much into it now, maybe in a later post, but residuals are basically the differences between the true value of Y and the predicted/estimated value of Y. It is important to note that in a linear regression, we are trying to predict a continuous variable. In a regression model, we are trying to minimize these errors by finding the “line of best fit” — the regression line from the errors would be minimal. We are trying to minimize the distance of the red dots from the blue line — as close to zero as possible. It is related to (or equivalent to) minimizing the mean squared error (MSE) or the sum of squares of error (SSE), also called the “residual sum of squares.” (RSS) but this might be beyond the scope of this blog post :-)

In most cases, we will have more than one independent variable — we’ll have multiple variables; it can be as little as two independent variables and up to hundreds (or theoretically even thousands) of variables. in those cases we will use a Multiple Linear Regression model (MLR). The regression equation is pretty much the same as the simple regression equation, just with more variables:

Y’i = b0 + b1X1i + b2X2i

This concludes the math portion of this post :) Ready to get to implementing it in Python?

Project 1: Predicting Boston Housing Prices

Here we perform a simple regression analysis on the Boston housing data, exploring simple types of linear regression model.

I will use Boston Housing data set, the data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository. UCI machine learning repository contains many interesting data sets, I encourage you to go through it.

We will be using sklearn to import the boston data as it contains a bunch of useful datasets to practice along and we will also import Linear Regression Model from sklearn. Although, you can also code your Linear Regression model as a function or class in python and its easy.

from sklearn.datasets import load_boston
data = load_boston()

Print a histogram of the quantity to predict: price

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

plt.figure(figsize=(15, 6))
plt.hist(data.target)
plt.xlabel('price ($1000s)')
plt.ylabel('count')
plt.tight_layout()

Print the join histogram for each feature

for index, feature_name in enumerate(data.feature_names):
    plt.figure(figsize=(4, 3))
    plt.scatter(data.data[:, index], data.target)
    plt.ylabel('Price', size=15)
    plt.xlabel(feature_name, size=15)
    plt.tight_layout()

Prediction

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
expected = y_test

plt.figure(figsize=(15, 6))
plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
plt.tight_layout()

Print the error rate

import numpy as np
print("RMS: %r " % np.sqrt(np.mean((predicted - expected) ** 2)))

RMS: 5.3588890300591467

So, this was the simplest implementation of Linear Regression to predict the house prices and validate your model, in the next post, I will come up with a strong implementation of predictive model for the same problem.

Stay Tuned!! Surprises Coming!!

Feel free to criticize.

Akash Kandpal

Assistant Manager at State Bank of India

7 年

" %matplotlib inline " afaik is used in jupyter notebooks only .

Akash Kandpal

Assistant Manager at State Bank of India

7 年

One thing which I liked about this post is the practicality , keep them coming.

1 次回应

Kumar Rishav

Software Engineer at Wayfair | Ex Accenture Japan

7 年

Sir, the link that you have mentioned above i.e, the Boston Housing Data Set and the UCI Machine Learning repository is not working

Anjali S.

Co-Founder

7 年

Really Helpful.... Thanks

1 次回应

Nikhil Kumar Jha

Applied Scientist @ Audible, Amazon | Data Science

7 年

Really informative. Great job.

1 次回应

查看更多评论

要查看或添加评论，请登录

Shivam Panchal的更多文章

Best Resources for Data Science Enthusiasts- A Complete List

2020年6月20日

Best Resources for Data Science Enthusiasts- A Complete List

Free Books R Python Libraries Libraries for Python Libraries for R Complete Beginner Resources ML, DL and RL in Python…
Machine Learning, Deep Learning and Artificial Intelligence Resources for all

2020年6月15日

Machine Learning, Deep Learning and Artificial Intelligence Resources for all

Here is a bunch of machine learning resources, thought I'd share it here. ★ are resources that were highly recommended…

1 条评论
Machine Learning 10: 'Recommendation System'

2018年7月18日

Machine Learning 10: 'Recommendation System'

Why do the we care about the Recommendation Systems? The answer to this question may be different based on different…
Machine Learning 9: 'Sequential Rule Mining'

2018年6月24日

Machine Learning 9: 'Sequential Rule Mining'

Sequential Rule Mining is a data mining technique which consists of discovering rules in sequences. Sequential Rule…

4 条评论
Machine Learning 8: 'Clustering Algorithms'

2018年6月7日

Machine Learning 8: 'Clustering Algorithms'

In the last week, we explored classification and Random Forest algorithm and that was a part of Supervised Machine…

2 条评论
Machine Learning 7:'Classification' Day 3

2018年3月24日

Machine Learning 7:'Classification' Day 3

In the last post, I discussed about Decision Tree. In this post, I will be discussing about Random Forest Algorithm…

9 条评论
Machine Learning 6:'Classification' Day 2

2018年3月14日

Machine Learning 6:'Classification' Day 2

Keep asking yes/no questions. With each question continue to significantly narrow down the space of possibly secrets.

6 条评论
Machine Learning : 'Classification' - Day 1

2018年3月9日

Machine Learning : 'Classification' - Day 1

In this post, we are starting off the classification, firstly, we will get into the difference between classification…

17 条评论
Machine Learning : 'Regression' - Day 4

2018年3月2日

Machine Learning : 'Regression' - Day 4

In this post which will be the last one on regression analysis, I will be discussing about the following topics in…

3 条评论
Machine Learning : 'Regression' - Day 3

2018年2月28日

Machine Learning : 'Regression' - Day 3

In the last to last post, we discussed about what is Regression and in the last one, we talked about the assumptions or…

9 条评论

See all articles

Machine Learning : 'Regression' - Day 1

Shivam Panchal

Data Scientist | Machine Learning Engineer

Chapter 1 : Regression

A Little Bit About the Math

Project 1: Predicting Boston Housing Prices

Shivam Panchal的更多文章

社区洞察

其他会员也浏览了

?? Master PCA, t-SNE, and SVD in Python! ??

SIMPLE LINEAR REGRESSION IN PYTHON :

Python overtakes R, becomes the leader in Data Science, Machine Learning platforms

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A detailed K-nearest Neighbors classifier in Python

Day 5: Python Casting – Mastering Variable Types!

17 Top Applications of Machine Learning with Python

Machine Learning 101 All Algorithms in python (Linear Regression)

Building Your First Machine Learning Model: A Step-by-Step Tutorial

Chapter 1 : Regression

A Little Bit About the Math

Project 1: Predicting Boston Housing Prices

Shivam Panchal的更多文章

Best Resources for Data Science Enthusiasts- A Complete List

Machine Learning, Deep Learning and Artificial Intelligence Resources for all

Machine Learning 10: 'Recommendation System'

Machine Learning 9: 'Sequential Rule Mining'

Machine Learning 8: 'Clustering Algorithms'

Machine Learning 7:'Classification' Day 3

Machine Learning 6:'Classification' Day 2

Machine Learning : 'Classification' - Day 1

Machine Learning : 'Regression' - Day 4

Machine Learning : 'Regression' - Day 3

社区洞察

其他会员也浏览了

?? Master PCA, t-SNE, and SVD in Python! ??

SIMPLE LINEAR REGRESSION IN PYTHON :

Python overtakes R, becomes the leader in Data Science, Machine Learning platforms

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A detailed K-nearest Neighbors classifier in Python

Day 5: Python Casting – Mastering Variable Types!

17 Top Applications of Machine Learning with Python

Machine Learning 101 All Algorithms in python (Linear Regression)

Building Your First Machine Learning Model: A Step-by-Step Tutorial