How to predict the future
Simple Linear Regression
In this article, I am going to show you how to predict the future.
Okay, so I am being a little cavalier here. Perhaps I should say - I am going to show you how to predict the future value of a variable based on its past interactions with one other variable, on the condition that they have a linear relationship. Not so snappy, but a little less misleading.
Predicting the future has captivated mankind since the dawn of time. In fact, you could say that predicting the future is the great difference between mankind and the rest of the animal kingdom. Not only can we extrapolate future events based on those of the past, but we have the ability to put our predictions of the future back through the our brain, and extrapolate from those - abstract thought.
In this article, I am going to show you how we can use Simple Linear Regression to make predictions from data.
“The best predictor of future behavior is … past behavior”
What is Simple Linear Regression?
Simple Linear Regression is the calculation of a trend from 2 linearly related variables. One of which is hypothesised to explain the behaviour of the other.
As an example, take the hours of study and scores in an exam. One would imagine that the time you put into studying affects exam results. We then say that these variables x = hours of study, and y= exam results have a linear relationship.
The increase in score can be explained and therefore predicted by the hours of study. If i can build a linear model on past data, you should be able to give me the hours of study which a student has put in, I can run it through my model, and I can make a prediction on the score which they will get.
You will often see this depicted as a scatterplot, with a nice straight line going through them, at an angle.
There are a few different methods for creating a Simple linear regression model using machine learning. In this article I will show you 3 (or 4, not fully decided yet) which can be used.
I will show you how we can create a SLR model using vanilla Python on some dummy data, how we can use something called Stochastic gradient Descent to continually update our model until we find ‘the sweet spot’, stirring in some Linear Algebra using the Normal Equation.
Before we dive into the code, lets take a look at the maths behind the curtain.
Maths of the Simple Linear Regression model.
In the example of the student’s hours of study and the grade on his or her exam, we have a model for the data which we are able to plug in the hours of study and predict for the grade. Our model follows a mathematical formula which goes a little something like this:
On the left hand side our predicted grade - y. On the far right we have our hours of study - X, and also we have B0 and B1, these are our Coefficients. The Coefficients are the values which allow us to do our magic. So, all we need to do is find what values our Coefficients; should be easy right?
First we need to know B1 - how steep our line should be. Or, put another way, how much it should slope by. To calculate the Slope, we use this formula:
If you are unfamiliar with Sigma Notation, let me unpack the above:
- x? = Each data-point which represents a persons hours of study
- x? = the average hours of study
- Yi = Each data-point which represents the persons grade
- Y with the little bar above = the average grade
- So, we subtract all x? from the x?. We do the same thing with the Ys and multiply these 2 products together.
- The we divide this by the squared product of the Sum of each x? subtracted from the x?.
Now lets solve for B0, which is our Intercept. Remember when I said that we normally see a nice straight line drawn through our values, well the straight line will at some point cross our Y axis, but it could cross high up or low down. The point at which it crossed our Y axis is our Intercept. We can calculate this by:
Plugging these values into our formula, we now have our model for the Simple Linear Regression.
Okay, so hopefully you are still with me and that all made sense. Now, as we our ardent machine learners, we going to want to fire up our IDE and see how we can create and run this model in code.
Simple Linear Regression in Vanilla Python
First off, to make explicit what is going on, I’m going to take this to the metal and code this in as transparent a way as possible.
import statistics
import matplotlib.pyplot as plt
#Create some dummy data
xs = [1, 2, 3, 4, 5]
ys = [1, 4, 9, 16, 25]
#Function to calculate our Coefficients
def basic_linear_regression1(x, y):
length = len(x)
sum_x = sum(x)
sum_y = sum(y)
sum_x_squared = sum(map(lambda a: a ** 2, x))
sum_of_products = sum([x[i] * y[i] for i in range(length)])
m = (sum_of_products - (sum_x * sum_y) / length) / \
(sum_x_squared - ((sum_x ** 2) / length))
b = (sum_y - m * sum_x) / length
return m, b
#Get our Coefficients
m, b = basic_linear_regression1(xs, ys)
#use our Coefficients to create our regression line
reggression_line = [(m * x + b) for x in xs]
#Create a new value to test our model
new_x = 8
#Make a prediction
predict_y = m * new_x + b
print("Slope: ", m, " Intercept: ", b, " x: ", new_x, "Yhat: ", m * new_x + b)
#Plot our regressin line
plt.plot(xs, reggression_line, color="g", lw=2, alpha=0.5)
#Plot our original data
plt.scatter(xs,ys)
#Plot our prediction
plt.scatter(new_x, predict_y, s=20, color='r', alpha=1)
plt.show()
Stochastic gradient descent
Now that we have a good intuition of model and have worked through coding this up in vanilla python, lets take a look at another method for finding the best Coefficients.
So, Simple Linear Regression is something called a Convex Problem. To understand this, we need to to understand one other key point to our model.
On drawing our line through our variables, we want the line to be as close to as many of out points as possible. Imagine that our data fell so tightly in a line already, and when we created our model and drew it our regression line on top, it overlapped with the actual data. It would be clear to see that this would be a very precise model, which describes our data perfectly (too perfectly?). It is extremely unlikely that you would see a fit this close, but the idea is that the closer we can get as possible, the better our model is. Conversely, the further away our data points our from our line, the worse our model is.
This distance between each point and our linear regression line are called residuals. Like the name implies, they are the bits of noise left over between our nice regression line and our data. They expose to the viewer the imprecision of our model and, if you image that we are paid based on the fit of our model, these residuals cost us money. This is the cost that we want to minimise.
Now, I mentioned before that Simple Linear Regression was a Convex problem. If we keep wiggling our line around, there would eventually be a ‘sweet spot’ at which our cost would be at a minimum. You can imagine trying to tune in an analogue radio or TV. A little to the left, to the right….and okay, “don’t move”. This is like our cost function, wiggle, wiggle, good!
So we know that there exists a sweet spot at which our cost will be at a minimum, but how do we find it. This is where the Convexity (no idea if that is a real word) comes into play. It turns out that if we plot on a graph all those wiggles, then it will plot a shape will a letter U, or a bowl. At the bottom is the sweet spot, and it gets steeper at the sides. Now you can see why we call it ‘…. Gradient Descent’ - our method is to walk down the sides of the U or bowl until we find this minimum cost or, sweet spot.
The Stochastic part you can just read as - “we are going to take tiny tentative steps. After each step, we are going to take a look around, and see if we are at a lower point in the bowl than we were before”. If we are continually lower, we are heading in the right direction, if we start climbing back up, we have gone to far, and take smaller steps back down in the directions we came.
Back in Python, this is how we can run Stochastic Gradient Descent to minimise the cost or our troublesome residuals, and find our optimal Coefficients.
import numpy as np
import sklearn
from sklearn.datasets.samples_generator import make_regression
import matplotlib.pyplot as plt
from scipy import stats
def gradient_descent(alpha, x, y, ep=0.0001, max_iter=10000):
converged = False
iter = 0
m = x.shape[0] # number of samples
# initial theta
t0 = np.random.random(x.shape[1])
t1 = np.random.random(x.shape[1])
# total error, J(theta)
J = sum([(t0 + t1*x[i] - y[i])**2 for i in range(m)])
# Iterate Loop
while not converged:
# for each training sample, compute the gradient (d/d_theta j(theta))
grad0 = 1.0/m * sum([(t0 + t1*x[i] - y[i]) for i in range(m)])
grad1 = 1.0/m * sum([(t0 + t1*x[i] - y[i])*x[i] for i in range(m)])
# update the theta_temp
temp0 = t0 - alpha * grad0
temp1 = t1 - alpha * grad1
# update theta
t0 = temp0
t1 = temp1
# mean squared error
e = sum( [ (t0 + t1*x[i] - y[i])**2 for i in range(m)] )
if abs(J-e) <= ep:
print('Converged, iterations: ', iter, '!!!')
converged = True
J = e # update error
iter += 1 # update iter
if iter == max_iter:
print('Max interactions exceeded!')
converged = True
return t0,t1
if __name__ == '__main__':
x, y = make_regression(n_samples=100, n_features=1, n_informative=1,
random_state=0, noise=35)
print('x.shape = %s y.shape = %s' %(x.shape, y.shape))
alpha = 0.01 # learning rate
ep = 0.01 # convergence criteria
# call gradient decent, and get intercept(=theta0) and slope(=theta1)
theta0, theta1 = gradient_descent(alpha, x, y, ep, max_iter=1000)
# plot
for i in range(x.shape[0]):
y_predict = theta0 + theta1*x
plt.plot(x,y,'o')
plt.plot(x,y_predict,'k-')
plt.show()
Normal Equation
The above code is a little verbose, thankfully we can use a bit of Linear Algebra to solve for this cost minimisation function by using the Normal Equations.
import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_regression
#Use SKlearn to create some random values for us
xs, ys = make_regression(n_samples=100, n_features=1, n_informative=1,
random_state=0, noise=35)
#Our Gradient Descent function to minimise our cot funtion - baby steps!
def gradient_Descent_Normal_Equation(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
gradient = np.dot(xTrans, loss) / m
theta = theta - alpha * gradient
return theta
#Get the number of rows and features ( m = rows = 100, features = n = 1)
m, n = np.shape(xs)
#Maimum numnber of iterations it we have not already convereged (found our sweet spot)
numIterations= 100000
#This is the size of the 'little steps' we will take to find our optimal (sweet spot)
alpha = 0.0005
#Starting point for our gradient descent
theta = np.ones(n)
#Our predicte value
theta = gradient_Descent_Normal_Equation(xs, ys, theta, alpha, m, numIterations)
print(theta)
plt.scatter(xs,ys)
plt.show()
Hopefully, by now you have a better idea of what Simple Linear Regression is, how it works, and come away with a few methods to apply it in Python.
However, you may be asking, what if my hypothesis involved more than 1 explanatory variables. For example, perhaps we think that, yes, hours of study is a predictor for exam grade, but I believe that it also is explained by hours of sleep. Well, that is out of the scope for this introduction into Linear Regression, in a future post I will venture into the world of Multivariate Linear Regression, and show you how we can use Python to do that too.