Understanding Polynomial Linear Regression
Kunal Mehta
Global Data Platform Head | Product Owner | Associate Director | Data Analytics | Google Analytics | Adobe Analytics | Google Cloud Platform | Machine Learning | Data Science | Data Engineering | Speaker
Whenever we talk about regression, more often than not people assume simple linear regression, an equation which seems something like this:
My colleague Mayank Jain has covered the topic of linear regression in much detail, and you can read it by clicking here. So, what are we doing over here? See, in linear regression your data, when plotted with the dependent variable, would look something like this:
But you know in your heart of hearts, that this is not very common. More often than not, your data actually looks something like this:
Data which has some curve, or maybe some other weird shape:
As you might see from these graphs that there is definitely a correlation, and even a strong correlation between dependent and independent variables, but In such cases it’s not a best idea to try and create a straight line that can pass through these points to give us solid predictions. So, rather than aiming for a straight line, we have to aim for a curved line that can come close to plotting the pattern here. This curve is exact thing that polynomial regression aims to create.
A generic Polynomial linear regression equation looks something like this:
OK, now to illustrate this theory better, let’s take an example. Please feel free to download data by clicking on the link here. If you plot these 2 variables on a line chart, this is what you would get:
If we try to plot a simple linear regression on this kind of data, our regression line will look something like below:
As you might be able to see, our linear regression line is not really the best fit for our purpose. So, let’s see how we can create a polynomial linear regression equation, and measure it’s performance.
To implement this in Python, we use the library of PolynomialFeatures, and create the dataset. Within the library of PolynomialFeatures we will initialize an object, and we will define what should be the power of the X variable. The fit_transform() function of the library is used to transform X variable with the help of the code below:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X)
If you notice, for now we are using the 2nd degree polynomial.
To show you the example, your data set will look like this:
If you see, it shows your values in X variable in 3 powers:
X^0 = 1
X^1 = 89
X^2 = 7921
Now let’s run our linear regression model on this data set by using the code below:
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
Let us now again try to plot the line chart to see the fitment of the curve.
If you notice, this curve looks much closer to what we are looking for, as if it is almost super imposed over our data.
Let’s do one thing, let’s try to change the degree to 3, and see if that is a bit better than this. This time notice that in the below code, degree is 3.
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 3)
X_poly = poly_reg.fit_transform(X)
Now again let’s have a look at the small example of the data set.
If you notice, in this data set, there are 4 columns, according to the below calculation:
X^0 = 1
X^1 = 89
X^2 = 7921
X^3 = 704969
Now again we will create the regression equation through code below:
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
Once we try to plot our regression line on top of our data, this is what it would look like:
If you see closely, it’s extremely difficult to distinguish our regression line with the data line. It seems like that the 3rd degree polynomial linear regression is the best fit for the present data at hand. If I try to see the score of this model it actually comes to be 0.9999, which is insane, but then this was done on dummy data, this is why such high numbers are expected.
At this point of time you must be thinking about how to actually practice it yourself. So, I have a business problem for you. I have question and data set at the link here. Let’s see if you can work on this business problem, and come up with the solution? So, what say, challenge accepted?