Gradient Boosting: Introduction, Implementation, and Mathematics behind it.
A beginner-friendly introduction and an implementation in Python.
Introduction
Gradient Boosting is powerful and one of the most important topics in ensemble learning. It is a type of Boosting ensemble. I have explained about Boosting in our last two articles—AdaBoost: Introduction, Implementation, and Mathematics behind it and Boosting: Introduction.
If you haven’t already done so, I would highly recommend that you read it. It will help you understand more about Boosting and ultimately help you understand Gradient Boosting more deeply.
Now with that said, let's just quickly recap.
Boosting is an ensemble technique that combines multiple weak learners to form a strong learner. Weak learners are the learners/models that do not do better than a random guess.
Similar to AdaBoost, Gradient Boosting combines the weak learners sequentially to form a strong learner that performs better, but how it does is different.
How it actually works?
In a nutshell, Gradient Boosting works like this:
Takes data → initializes with a constant value → finds a residual → trains a model to predict the residual → calculate the predicted values based on the predicted residual → find another residual based on previous residual and repeat.
Before we start breaking down the steps of Gradient Boosting, let’s start with housing price data that we would be woking on.
Here, Bedroom and Bathroom are the Features and the Price is the prediction. We will use the features to predict the price of a house.
Let’s break it down the steps of Gradient Boosting now.
Remember that in Boosting method, we have to start with a Base Learner. In Gradient Boosting, before we create Base Learner, we have to go through couple of steps.
Initialize the model with a initial constant value/initial guess
In regression, there are lots of loss function we can use. We will use Mean Squared Error (MSE) here.
here, y it the original value, ? is the predicted value.
We want to find a value of ? here, since ? is the value we do not know.
Now, the equation (i) tells us that we have to find minimum value of the loss function.
We know that, in order to find a minimum of a function, we have to:
In our data above, y is the Price and we need to find ?. For simplicity we will omit the 000 from the price. So instead of 313000 we will write 313
Let’s find the initial constant/guess value, i.e. ?. Here is the calculation of the value.
One concept that you should remember is that, the first derivate of a function helps us find a minima (minimum value).
Upon calculation we found that the initial constant/guess value ? is 1013.
Great. Here is our updated table.
Calculate a Pseudo Residuals (r)
This is one of the important steps in this algorithm.
In this step, we find a residual (r) given by the formula:
One important concept to understand here is that, F(xi) is the ? value of the previous model.
lets make this equation easy to understand.
Now lets find what that partial derivate is.
Upon calculation, you can see that residual (r) is nothing but (y-?)
Upon the calculation of our first residual value (r1), our table looks like this:
Good, we successfully calculated our first residual (r).
Now, we use this value of r as a target and our features(bedroom and bathroom) to train a Decision Tree. ← This will be our next Base Learner.
Train a Base Learner hm(x)
In this step we train our first model where, r is the dependent feature and bedroom and bathroom as the independent features.
After training the Decision Tree, we predict the residual and we predict a value of r.
Note: The predicted value here are just for examples. (it may not reflect the true value)
Find a value of Gamma(??) that minimizes the loss for our model
From this equation, our MSE loss will be:
so our ??_m becomes
Now we calculate the minimum value of ??_m, similar to how we did in step 1 (by finding a first derivative with respect to ??)
here we got the optimal value of ?? as 0.
Update the model and Find the Prediction.
After finding the minimal value of ??, we update the model to get the final prediction.
After updating the model, and putting the values, we get the next predictions.
here, we got 1013 for all 3 data. (Note: this is just an example, and it is skewed from the real values)
Now, we go back to step 2 and repeat the process until we want (typically we set how many decision tree we want)
Although, the mathematics seems a little bit intimidating, the coding part is not hard at all.
Here is the implementation of Gradient Boosting Regressor in Scikit-Learn.
Its that easy to implement in python. We just have to know which learning rate to use, which loss function to use and how many decision trees (weak learners) we want to train.
Here is the example of Gradient Boosting used for House Price prediction.
I hope you learned about Gradient Boosting Regression and the mathematics behind it.
We will talk about Gradient Boosting Classification next.
This is a 8th article in the Series Forming a strong foundation. Here are the links to the previous articles: