Machine Learning : 'Regression' - Day 1
Hi everyone!
This is the first post of my new series of articles which will be specifically focusing on Machine Learning Algorithms and its practical implementation in Python using real world data sets.
I assume that you have basic knowledge of python programming and have used the basic libraries such as Numpy, Scipy, Pandas, Matplotlib and Sklearn.
If you haven't, I am attaching the links here for you!
Numpy: https://bit.ly/2k4igLd
Scipy: https://bit.ly/1oyJqYk
Pandas: https://bit.ly/2qs1lAJ
Matplotlib: https://bit.ly/2EMuVNG
Sklearn: https://bit.ly/2j049C4
I wanted to provide a quick introduction to building models in Python, and what better way to start than one of the very basic models, linear regression?
This will be the first post about machine learning and I plan to write about more complex models in the future. Stay tuned! But for right now, let’s focus on linear regression.
I want to focus on the concept of linear regression and mainly on the implementation of it in Python.
Here we go!!
Chapter 1 : Regression
Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship basically means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too.
As you can see, a linear relationship can be positive (independent variable goes up, dependent variable goes up) or negative (independent variable goes up, dependent variable goes down) or other. Like I said, I will focus on the implementation of regression models in Python, so I don’t want to delve too much into the math under the regression hood, but I will write a little bit about it.
A Little Bit About the Math
A relationship between variables Y and X is represented by this equation:
Y`i = mX + b
In this equation, Y is the dependent variable — or the variable we are trying to predict or estimate; X is the independent variable — the variable we are using to make predictions; m is the slope of the regression line — it represent the effect X has on Y.
This is Simple Linear Regression (SLR). In a SLR model, we build a model based on data — the slope and Y-intercept derive from the data; furthermore, we don’t need the relationship between X and Y to be exactly linear. SLR models also include the errors in the data (also known as residuals). I won’t go too much into it now, maybe in a later post, but residuals are basically the differences between the true value of Y and the predicted/estimated value of Y. It is important to note that in a linear regression, we are trying to predict a continuous variable. In a regression model, we are trying to minimize these errors by finding the “line of best fit” — the regression line from the errors would be minimal. We are trying to minimize the distance of the red dots from the blue line — as close to zero as possible. It is related to (or equivalent to) minimizing the mean squared error (MSE) or the sum of squares of error (SSE), also called the “residual sum of squares.” (RSS) but this might be beyond the scope of this blog post :-)
In most cases, we will have more than one independent variable — we’ll have multiple variables; it can be as little as two independent variables and up to hundreds (or theoretically even thousands) of variables. in those cases we will use a Multiple Linear Regression model (MLR). The regression equation is pretty much the same as the simple regression equation, just with more variables:
Y’i = b0 + b1X1i + b2X2i
This concludes the math portion of this post :) Ready to get to implementing it in Python?
Project 1: Predicting Boston Housing Prices
Here we perform a simple regression analysis on the Boston housing data, exploring simple types of linear regression model.
I will use Boston Housing data set, the data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository. UCI machine learning repository contains many interesting data sets, I encourage you to go through it.
We will be using sklearn to import the boston data as it contains a bunch of useful datasets to practice along and we will also import Linear Regression Model from sklearn. Although, you can also code your Linear Regression model as a function or class in python and its easy.
from sklearn.datasets import load_boston data = load_boston()
Print a histogram of the quantity to predict: price
import matplotlib.pyplot as plt %matplotlib inline plt.style.use('bmh') plt.figure(figsize=(15, 6)) plt.hist(data.target) plt.xlabel('price ($1000s)') plt.ylabel('count') plt.tight_layout()
Print the join histogram for each feature
for index, feature_name in enumerate(data.feature_names): plt.figure(figsize=(4, 3)) plt.scatter(data.data[:, index], data.target) plt.ylabel('Price', size=15) plt.xlabel(feature_name, size=15) plt.tight_layout()
Prediction
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data.data, data.target) from sklearn.linear_model import LinearRegression clf = LinearRegression() clf.fit(X_train, y_train) predicted = clf.predict(X_test) expected = y_test plt.figure(figsize=(15, 6)) plt.scatter(expected, predicted) plt.plot([0, 50], [0, 50], '--k') plt.axis('tight') plt.xlabel('True price ($1000s)') plt.ylabel('Predicted price ($1000s)') plt.tight_layout()
Print the error rate
import numpy as np print("RMS: %r " % np.sqrt(np.mean((predicted - expected) ** 2))) RMS: 5.3588890300591467
So, this was the simplest implementation of Linear Regression to predict the house prices and validate your model, in the next post, I will come up with a strong implementation of predictive model for the same problem.
Stay Tuned!! Surprises Coming!!
Feel free to criticize.
Assistant Manager at State Bank of India
7 年" %matplotlib inline " afaik is used in jupyter notebooks only .
Assistant Manager at State Bank of India
7 年One thing which I liked about this post is the practicality , keep them coming.
Software Engineer at Wayfair | Ex Accenture Japan
7 年Sir, the link that you have mentioned above i.e, the Boston Housing Data Set and the UCI Machine Learning repository is not working
Co-Founder
7 年Really Helpful.... Thanks
Applied Scientist @ Audible, Amazon | Data Science
7 年Really informative. Great job.