INTERPRETING POLYNOMIAL REGRESSION
Srikant Kumar
Advance Data Scientist @ Honeywell Technology| Transforming Data for Effective Decisions
This is my fourth article in the Machine Learning series. This article requires prior knowledge of Linear Regression. If you don’t know about Linear Regression or need a brush-up, please go through the previous articles in this series.
Let’s quickly recap what we studied in the last article.
- Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables.
- It is used to study the relationship between two or more variables that are related causally.
- Simple linear regression model allows us to study the relationships between two continuous numeric variables.
- Linear regression requires the relation between the dependent variable and the independent variable to be linear.
- multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables.
This article is concentrated on the polynomial regression model, which is useful when there is reason to believe that relationship between two variables is curvilinear.The polynomial regression model has been applied using the characterisation of the relationship between strains and drilling depth. Parameters of the model were estimated using a least square method. After fitting, the model was evaluated using some of the common indicators used to evaluate accuracy of regression model. The data were analyzed using computer program Python that performs these calculations.
Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. It is one of the most important statistical tools which is extensively used in almost all sciences. It is specially used in business and economics, to study the relationship between two or more variables that are related causally.
Before going for a definition of Polynomial regression, let’s have a look figures below.
Linear regression requires the relation between the dependent variable and the independent variable to be linear. So, if the data looks like this we can implement linear regression in this. But what if the data looks like this
Here in this figure we can see that the data is not linear i.e. it is non-linear or we can say it is curvilinear. Can linear models be used to fit such kind of non-linear data? How can we generate a curve that best captures the data as shown above? Well, we will answer these questions in this article.
Can linear models be used to fit such kind of non-linear data?
The answer is no. we cannot use linear models to fit these kind of data.
How can we generate a curve that best captures the data as shown above?
By using Polynomial regression, we can generate a curve that best capture the data as shown above.
Now, I think it is clear that when to use linear regression and when to use polynomial regression.
Now lets define polynomial regression,
Polynomial Linear regression is very similar to multiple linear regression but in Multiple Linear Regression the no of variables are different but in polynomial regression the no or variable is only one.
The Equation of Polynomial regression is
Next question on your mind would be why do we call it polynomial linear regression?
We call polynomial regression as linear because in polynomial regression power of X increases by one but we don't consider X. we take care of the coefficient which are always in power of 1. that's why we call polynomial regression as linear.
To understand the need for polynomial regression, lets dirty our hand.
Our first step would be to import the packages and loading of data
# Importing Packages for data loading, visualization and preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Loading Training and Testing Data
data=pd.read_csv('./Datasets/PolynomialRegression/HousingData.csv')
You can download the data from here.
Now, we have successfully loaded the dataset. now let's check the data
data
Lets start data preprocessing
data.isnull().sum()
Our data has two columns i.e. 'Purchase time passed(1990)' and 'Pricing'. lets divide the data into feature and target. here 'Purchase time passed(1990)' is feature and 'pricing' is our target data.
y = data[['Pricing']]
X = data[['Purchase time passed(1990)']]
Now lets divide the data into training and testing data using train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Now Fit the training data into polynomial model of degree 3
from sklearn.preprocessing import PolynomialFeatures
model = PolynomialFeatures(degree= 3)
X_train = model.fit_transform(X_train)
X_test = model.fit_transform(X_test)
We have now successfully transformed the data into degree 3.
Now time to implement Linear Regression
from sklearn.linear_model import LinearRegression
lg = LinearRegression() lg.fit(X_train,y_train)
Our model is trained. Now lets predict on test data
y_pred = lg.predict(X_test)
We have successfully trained our model and our model has predicted values for the test data. Now its time to check the preformance of our model.
In Regression, basically there are three parameters to measure the model's efficiency
- Mean Absolute Error
MAE is the average of the absolute difference between the predicted values and observed value. The MAE is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between 10 and 0 will be twice the difference between 5 and 0.
- Mean Square Error
The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value.
- R^2(r square) Error
R2 score is also known as coefficient of determination. It summarizes the explanatory power of the regression model and is computed from the sums-of-squares terms. It describes the proportion of variance of the dependent variable explained by the regression model. If the regression model is 'perfect', R2 is 1. If the regression model is 'bad', R2 is 0.
Here, we are using R^2 error to measure the model's efficiency.
First import r2_scores
from sklearn.metrics import r2_score
Now calculate the r^2 error
r2_score(y_test, y_pred)
The value of R^2 error is 0.97 which is tending towards 1. So, we can say that our model is perfoming very well.
Now lets change the value of degree to 4 and 2 and observe what changes do we get.
Lets start with degree 4
model = PolynomialFeatures(degree= 4)
X_train = model.fit_transform(X_train)
X_test = model.fit_transform(X_test)
from sklearn.linear_model import LinearRegression
lg = LinearRegression()
lg.fit(X_train,y_train)
# Predictiong the test data
y_pred = lg.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
With degree 4 we are getting r2_score as 0.34, Which is very poor i.e. our model is performing very bad with degree 4.
Now lets try with degree 2
model = PolynomialFeatures(degree= 2)
X_train = model.fit_transform(X_train)
X_test = model.fit_transform(X_test)
from sklearn.linear_model import LinearRegression
lg = LinearRegression()
lg.fit(X_train,y_train)
# Predictiong the test data
y_pred = lg.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
With degree 2 we are getting r2_score as -4748.73, Which is again very poor result.
How do we choose an optimal model? To answer this question we need to understand the bias vs variance trade-off.
End Notes
This article is getting longer, so with this I am going to stop this article. In the next part we will discuss about Bias, Variance, Under-fitting, Over-fitting and Best-fit model. I hope, I was able to make you understand the basic concept of Plynomial regression with using less maths and how to implement it and optimize it further to improve your model. Get your hands dirty by solving some problems. If you face any difficulties while implementing it, feel free to write on the comment section.
Did you find this article helpful? Please share your opinions / thoughts in the comments section below.