Gradient Boosting: Introduction, Implementation, and Mathematics behind it - For Classification
A detailed beginner friendly introduction and an implementation in Python.
Gradient Boosting(GB) is an ensemble technique that can be used for both Regression and Classification. In our last article, we talked about Gradient Boosting in the context of Regression. You can find the article here. In this article, we will talk about Gradient Boosting in the context of classification.
Before we talk about GB, lets talk why GB is an important algorithm.
There are couple of reasons. Three of the main reasons are:
For all of these reasons, GB is considered one of the best algorithms out there.
Before we dive into the sea lets recap some concepts.
Recap
Boosting
Boosting is an ensemble technique where weak models are added sequentially to create a stronger model. Weak models are usually those models that have prediction power similar(slightly better) to a flip of a coin.
With each iteration(learners) the weak models are expected to be better and better. Typically, the weak models are based out of Decision Trees.
In Gradient Boosting the first model f_0(x) is always a simple model. And Other models are decision trees(f_1(x) through f_n(x)); where n is the total number of weak base learners.
And the final model F(x) is the combination of all those models.
Note: We will work on Binary Classification with a Bank Churn Dataset in this article. The goal is to predict whether a customer continues with their account or closes it (e.g, churns).
There is a Kaggle Notebook that go with this article. I would suggest you to go through that as well.
Lets get started.
Steps in Gradient Boosting
1. Initial Model(f_0(x))
This is the first step of the algorithm. Similar to the regression problem, we calculate a initial prediction in classification problems as well.
Initial Models are always a simple model. In order to calculate the predictions of Initial model, we need two things:
odds is the ratio of total number of 1 to the total number of 0 in our target. For our dataset, it is the total of number of 1 in exited column to total number of 0 in exited column.
From the data, we found out the number of values of 1’s and 0’s
Total number of 1 = 34921
Total number of 0 = 130113
Now the odds and log odds will be:
Now the probability of log odds will be:
This tells us that, the probability of Exited being 1 is 0.21.
So, if we go by the threshold of 0.5 then 0.21 corresponds to 0.
Note: in Classification, we often calculate probability and if the probability > 0.5 it is 1, if probability < 0.5 then it is 0.
Hence, the prediction of our initial model is 0.
Meaning, no matter what the input is, the prediction will always be 0 (customer will not close their account).
This is not a correct prediction, since the customer will close their account for some circumstances.
So we need to fix this and take our prediction towards the correct ones. This is where our Step 2 starts.
Note: You can also calculate the probabilities by passing normalize=True in value_counts() function.
2. Calculate Residuals
In order to fix the problem, we first calculate the residuals.
Residuals is nothing but a subtraction of true value with the probability value.
in our case, value of Exited - probability got from step 1
Note: Residuals usually gives higher values to the mistakes made by model 1. Now we take these residual values and create a new model. This is where we start building a Decision Tree model.
3. Train Weak learners
We usually use Decision Tree as a weak learner here.
The input of our decision tree will be the features of our data.
and the output will be the residual that we calculated (residual_1).
Here is the sneak peak of the data.
Because of the nature of our target (residual_1), we train a Decision Tree Regression model here.
Since, we are training a weak model, we will set depth of the tree as 3.
After training the model, this is how our decision tree looks like for our weak model 1 (f_1(x))
Now,
We need a little bit trick here now.
Remember, we have to calculate log(odds) and then probability of log(odds) to find a prediction of the model.
However, here we only have predicted residual of the model 1.
To covert this residual to log(odds) we have a formula:
领英推荐
Please note that the residual here is residual_1 (not predicted residual)
Calculate log(odds)
Let’s quickly look at our model, the leaf node on the bottom are 2,3,5,6.
Now we have to calculate log(odds) for leaf nodes (2,3,5,6).
In order to do that, we should know, all the samples that fall into them. From the decision tree, we can see that, the leaf node 2 has 52,739 samples, 3 has 71,577 samples, 5 has 24,635 samples, and 6 has 16,083 samples.
We have a little bit trick in scikit-learn the helps us to see where the samples falls in.
After getting this output, we can use it and put it in the formula to calculate the log(odds).
For simplicity, lets calculate log(odds) of leaf 3.
We have previous probability (p_0), the residual (residual_1) and the leaf node number where the predicted residual falls in (leaf_node_ids_1).
Now we can use all of these values to calculate the log of odds.
Now the log(odds) for leaf node 3, we have total samples 52,739. and the previous residual values looks like -0.211599 and the previous probability looks like 0.211599.
We get -0.947. So,
log(odds) for 3 = -0.947
Now,
We add this log odds with the previous log odds. The sum will be our log(odds) for the weak model 1.
-2.267 is the log(odds) for the leaf node 3.
Similarly, we cal calculate the values for leaf nodes 2,5, and 6 as well. We will get
log(odds) for 2 = -1.299
log(odds) for 5 = 1.187
log(odds) for 6 = -1.012
Now, let’s calculate the probabilities, with the formula:
The prediction probabilities will be:
probability for leaf 2 = 0.214
probability for leaf 3 = 0.094
probability for leaf 5 = 0.766
probability for leaf 6 = 0.267
4. Updating the Predictions
Now depending on those probabilities we calculated, we will update our predictions.
If we go by the 0.5 threshold then any probability that has more than 0.5 will give us 1 and any thats less than 0.5 gives us 0.
Meaning, Our leaf nodes 2,3, and 6 will predict 0.
And leaf node 5 will predict 1.
This is the output of our combined model f_0 and f_1.
Here is our predictions when we combined two models.
You can see that the predictions of the last observation is incorrect.
In order to fix that, we can repeat step 2-4 again and again (which is our step 5) and add multiple models.
Here is the result of our final predictions:
Bonus
You can also add a hyper-parameter value called learning rate(??) while adding two log(odds).
The value of lambda will make help us increase the value of log(odds) faster or slower depending on the value.
The value of ?? ranges from 0 to 1.
Great, this is all we need to understand about the Gradient Boosting for Classification.
Important
All these concepts aside, you can directly use Gradient Boosting Classifier from scikit-learn to do all of these steps for us.
Now that you have some intuition about how Gradient Boosting works. Please go through the documentation in scikit learn to see the parameters that you can fine tune.
Good Luck!
We will talk about eXtreme Gradient Boosting in our next article.
Until then, keep learning.
This is a 9th article in the Series Forming a strong foundation. Here are the links to the previous articles: