Machine Learning :  'Regression'? - Day 1

Machine Learning : 'Regression' - Day 1

Hi everyone!

This is the first post of my new series of articles which will be specifically focusing on Machine Learning Algorithms and its practical implementation in Python using real world data sets.

I assume that you have basic knowledge of python programming and have used the basic libraries such as Numpy, Scipy, Pandas, Matplotlib and Sklearn.

If you haven't, I am attaching the links here for you!

Numpy: https://bit.ly/2k4igLd

Scipy: https://bit.ly/1oyJqYk

Pandas: https://bit.ly/2qs1lAJ

Matplotlib: https://bit.ly/2EMuVNG

Sklearn: https://bit.ly/2j049C4

I wanted to provide a quick introduction to building models in Python, and what better way to start than one of the very basic models, linear regression?

This will be the first post about machine learning and I plan to write about more complex models in the future. Stay tuned! But for right now, let’s focus on linear regression.

No alt text provided for this image

I want to focus on the concept of linear regression and mainly on the implementation of it in Python. 

Here we go!!

Chapter 1 : Regression

Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship basically means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too.

No alt text provided for this image

As you can see, a linear relationship can be positive (independent variable goes up, dependent variable goes up) or negative (independent variable goes up, dependent variable goes down) or other. Like I said, I will focus on the implementation of regression models in Python, so I don’t want to delve too much into the math under the regression hood, but I will write a little bit about it.

No alt text provided for this image

A Little Bit About the Math

A relationship between variables Y and X is represented by this equation:

Y`i = mX + b

In this equation, Y is the dependent variable — or the variable we are trying to predict or estimate; X is the independent variable — the variable we are using to make predictions; m is the slope of the regression line — it represent the effect X has on Y.

This is Simple Linear Regression (SLR). In a SLR model, we build a model based on data — the slope and Y-intercept derive from the data; furthermore, we don’t need the relationship between X and to be exactly linear. SLR models also include the errors in the data (also known as residuals). I won’t go too much into it now, maybe in a later post, but residuals are basically the differences between the true value of Y and the predicted/estimated value of Y. It is important to note that in a linear regression, we are trying to predict a continuous variable. In a regression model, we are trying to minimize these errors by finding the “line of best fit” — the regression line from the errors would be minimal. We are trying to minimize the distance of the red dots from the blue line — as close to zero as possible. It is related to (or equivalent to) minimizing the mean squared error (MSE) or the sum of squares of error (SSE), also called the “residual sum of squares.” (RSS) but this might be beyond the scope of this blog post :-)

In most cases, we will have more than one independent variable — we’ll have multiple variables; it can be as little as two independent variables and up to hundreds (or theoretically even thousands) of variables. in those cases we will use a Multiple Linear Regression model (MLR). The regression equation is pretty much the same as the simple regression equation, just with more variables:


Y’i = b0 + b1X1i + b2X2i

This concludes the math portion of this post :) Ready to get to implementing it in Python?

Project 1: Predicting Boston Housing Prices

Here we perform a simple regression analysis on the Boston housing data, exploring simple types of linear regression model.

I will use Boston Housing data set, the data set  contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository. UCI machine learning repository contains many interesting data sets, I encourage you to go through it.

We will be using sklearn to import the boston data as it contains a bunch of useful datasets to practice along and we will also import Linear Regression Model from sklearn. Although, you can also code your Linear Regression model as a function or class in python and its easy.


from sklearn.datasets import load_boston
data = load_boston()

Print a histogram of the quantity to predict: price

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

plt.figure(figsize=(15, 6))
plt.hist(data.target)
plt.xlabel('price ($1000s)')
plt.ylabel('count')
plt.tight_layout()
No alt text provided for this image


Print the join histogram for each feature

for index, feature_name in enumerate(data.feature_names):
    plt.figure(figsize=(4, 3))
    plt.scatter(data.data[:, index], data.target)
    plt.ylabel('Price', size=15)
    plt.xlabel(feature_name, size=15)
    plt.tight_layout()
No alt text provided for this image

Prediction

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
expected = y_test

plt.figure(figsize=(15, 6))
plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
plt.tight_layout()
No alt text provided for this image


Print the error rate

import numpy as np
print("RMS: %r " % np.sqrt(np.mean((predicted - expected) ** 2)))

RMS: 5.3588890300591467 


So, this was the simplest implementation of Linear Regression to predict the house prices and validate your model, in the next post, I will come up with a strong implementation of predictive model for the same problem.

Stay Tuned!! Surprises Coming!!

Feel free to criticize.

Akash Kandpal

Assistant Manager at State Bank of India

7 年

" %matplotlib inline " afaik is used in jupyter notebooks only .

回复
Akash Kandpal

Assistant Manager at State Bank of India

7 年

One thing which I liked about this post is the practicality , keep them coming.

Kumar Rishav

Software Engineer at Wayfair | Ex Accenture Japan

7 年

Sir, the link that you have mentioned above i.e, the Boston Housing Data Set and the UCI Machine Learning repository is not working

回复

Really Helpful.... Thanks

Nikhil Kumar Jha

Applied Scientist @ Audible, Amazon | Data Science

7 年

Really informative. Great job.

要查看或添加评论,请登录

Shivam Panchal的更多文章

  • Best Resources for Data Science Enthusiasts- A Complete List

    Best Resources for Data Science Enthusiasts- A Complete List

    Free Books R Python Libraries Libraries for Python Libraries for R Complete Beginner Resources ML, DL and RL in Python…

  • Machine Learning, Deep Learning and Artificial Intelligence Resources for all

    Machine Learning, Deep Learning and Artificial Intelligence Resources for all

    Here is a bunch of machine learning resources, thought I'd share it here. ★ are resources that were highly recommended…

    1 条评论
  • Machine Learning 10: 'Recommendation System'

    Machine Learning 10: 'Recommendation System'

    Why do the we care about the Recommendation Systems? The answer to this question may be different based on different…

  • Machine Learning 9: 'Sequential Rule Mining'

    Machine Learning 9: 'Sequential Rule Mining'

    Sequential Rule Mining is a data mining technique which consists of discovering rules in sequences. Sequential Rule…

    4 条评论
  • Machine Learning 8: 'Clustering Algorithms'

    Machine Learning 8: 'Clustering Algorithms'

    In the last week, we explored classification and Random Forest algorithm and that was a part of Supervised Machine…

    2 条评论
  • Machine Learning 7:'Classification' Day 3

    Machine Learning 7:'Classification' Day 3

    In the last post, I discussed about Decision Tree. In this post, I will be discussing about Random Forest Algorithm…

    9 条评论
  • Machine Learning 6:'Classification' Day 2

    Machine Learning 6:'Classification' Day 2

    Keep asking yes/no questions. With each question continue to significantly narrow down the space of possibly secrets.

    6 条评论
  • Machine Learning : 'Classification' - Day 1

    Machine Learning : 'Classification' - Day 1

    In this post, we are starting off the classification, firstly, we will get into the difference between classification…

    17 条评论
  • Machine Learning : 'Regression' - Day 4

    Machine Learning : 'Regression' - Day 4

    In this post which will be the last one on regression analysis, I will be discussing about the following topics in…

    3 条评论
  • Machine Learning : 'Regression' - Day 3

    Machine Learning : 'Regression' - Day 3

    In the last to last post, we discussed about what is Regression and in the last one, we talked about the assumptions or…

    9 条评论

社区洞察

其他会员也浏览了