Gradient Boosting: Introduction, Implementation, and Mathematics behind it - For Classification

Gradient Boosting: Introduction, Implementation, and Mathematics behind it - For Classification

A detailed beginner friendly introduction and an implementation in Python.

Gradient Boosting(GB) is an ensemble technique that can be used for both Regression and Classification. In our last article, we talked about Gradient Boosting in the context of Regression. You can find the article here. In this article, we will talk about Gradient Boosting in the context of classification.

Before we talk about GB, lets talk why GB is an important algorithm.

There are couple of reasons. Three of the main reasons are:

  1. Resilient to Overfitting
  2. Good at choosing important features
  3. Adaptive Learning - It corrects the errors of previous model and adaptively learns.
  4. One more thing that I would like to add as a bonus is: GB gives us high accuracy (both for regression and classification)

For all of these reasons, GB is considered one of the best algorithms out there.


Before we dive into the sea lets recap some concepts.

Recap

Boosting

Boosting is an ensemble technique where weak models are added sequentially to create a stronger model. Weak models are usually those models that have prediction power similar(slightly better) to a flip of a coin.

Base Learners in Ensemble Learning

With each iteration(learners) the weak models are expected to be better and better. Typically, the weak models are based out of Decision Trees.


In Gradient Boosting the first model f_0(x) is always a simple model. And Other models are decision trees(f_1(x) through f_n(x)); where n is the total number of weak base learners.

And the final model F(x) is the combination of all those models.

Note: We will work on Binary Classification with a Bank Churn Dataset in this article. The goal is to predict whether a customer continues with their account or closes it (e.g, churns).

There is a Kaggle Notebook that go with this article. I would suggest you to go through that as well.

Lets get started.

Steps in Gradient Boosting

  1. Initial Model(f_0(x))
  2. Calculate Residuals
  3. to predict residuals
  4. Update the predictions
  5. Repeat 2-4.

1. Initial Model(f_0(x))

This is the first step of the algorithm. Similar to the regression problem, we calculate a initial prediction in classification problems as well.

Initial Models are always a simple model. In order to calculate the predictions of Initial model, we need two things:

  1. log(odds)
  2. probability of log(odds)

odds is the ratio of total number of 1 to the total number of 0 in our target. For our dataset, it is the total of number of 1 in exited column to total number of 0 in exited column.

From the data, we found out the number of values of 1’s and 0’s

Total number of 1 = 34921

Total number of 0 = 130113

Now the odds and log odds will be:

Now the probability of log odds will be:

This tells us that, the probability of Exited being 1 is 0.21.

So, if we go by the threshold of 0.5 then 0.21 corresponds to 0.

Note: in Classification, we often calculate probability and if the probability > 0.5 it is 1, if probability < 0.5 then it is 0.

Hence, the prediction of our initial model is 0.

Meaning, no matter what the input is, the prediction will always be 0 (customer will not close their account).

This is not a correct prediction, since the customer will close their account for some circumstances.

So we need to fix this and take our prediction towards the correct ones. This is where our Step 2 starts.


Note: You can also calculate the probabilities by passing normalize=True in value_counts() function.


2. Calculate Residuals

In order to fix the problem, we first calculate the residuals.

Residuals is nothing but a subtraction of true value with the probability value.

in our case, value of Exited - probability got from step 1

Note: Residuals usually gives higher values to the mistakes made by model 1. Now we take these residual values and create a new model. This is where we start building a Decision Tree model.

3. Train Weak learners

We usually use Decision Tree as a weak learner here.

The input of our decision tree will be the features of our data.

and the output will be the residual that we calculated (residual_1).

Here is the sneak peak of the data.

Because of the nature of our target (residual_1), we train a Decision Tree Regression model here.

Since, we are training a weak model, we will set depth of the tree as 3.

After training the model, this is how our decision tree looks like for our weak model 1 (f_1(x))

Now,

We need a little bit trick here now.

Remember, we have to calculate log(odds) and then probability of log(odds) to find a prediction of the model.

However, here we only have predicted residual of the model 1.

To covert this residual to log(odds) we have a formula:

Please note that the residual here is residual_1 (not predicted residual)

Calculate log(odds)


Let’s quickly look at our model, the leaf node on the bottom are 2,3,5,6.

Now we have to calculate log(odds) for leaf nodes (2,3,5,6).

In order to do that, we should know, all the samples that fall into them. From the decision tree, we can see that, the leaf node 2 has 52,739 samples, 3 has 71,577 samples, 5 has 24,635 samples, and 6 has 16,083 samples.

We have a little bit trick in scikit-learn the helps us to see where the samples falls in.

After getting this output, we can use it and put it in the formula to calculate the log(odds).

For simplicity, lets calculate log(odds) of leaf 3.


We have previous probability (p_0), the residual (residual_1) and the leaf node number where the predicted residual falls in (leaf_node_ids_1).

Now we can use all of these values to calculate the log of odds.

Now the log(odds) for leaf node 3, we have total samples 52,739. and the previous residual values looks like -0.211599 and the previous probability looks like 0.211599.

We get -0.947. So,

log(odds) for 3 = -0.947        

Now,

We add this log odds with the previous log odds. The sum will be our log(odds) for the weak model 1.

-2.267 is the log(odds) for the leaf node 3.

Similarly, we cal calculate the values for leaf nodes 2,5, and 6 as well. We will get

log(odds) for 2 = -1.299

log(odds) for 5 = 1.187

log(odds) for 6 = -1.012

Now, let’s calculate the probabilities, with the formula:

The prediction probabilities will be:

probability for leaf 2 = 0.214

probability for leaf 3 = 0.094

probability for leaf 5 = 0.766

probability for leaf 6 = 0.267

4. Updating the Predictions

Now depending on those probabilities we calculated, we will update our predictions.

If we go by the 0.5 threshold then any probability that has more than 0.5 will give us 1 and any thats less than 0.5 gives us 0.

Meaning, Our leaf nodes 2,3, and 6 will predict 0.

And leaf node 5 will predict 1.

This is the output of our combined model f_0 and f_1.

Here is our predictions when we combined two models.

You can see that the predictions of the last observation is incorrect.

In order to fix that, we can repeat step 2-4 again and again (which is our step 5) and add multiple models.

Here is the result of our final predictions:

Bonus

You can also add a hyper-parameter value called learning rate(??) while adding two log(odds).

The value of lambda will make help us increase the value of log(odds) faster or slower depending on the value.

The value of ?? ranges from 0 to 1.


Great, this is all we need to understand about the Gradient Boosting for Classification.


Important

All these concepts aside, you can directly use Gradient Boosting Classifier from scikit-learn to do all of these steps for us.

Now that you have some intuition about how Gradient Boosting works. Please go through the documentation in scikit learn to see the parameters that you can fine tune.

Good Luck!


We will talk about eXtreme Gradient Boosting in our next article.

Until then, keep learning.


This is a 9th article in the Series Forming a strong foundation. Here are the links to the previous articles:

  1. Why Should I Learn from the Beginning?
  2. Linear Regression: Introduction
  3. Regression: Evaluation Metrics/Loss Functions
  4. Decision Tree: Introduction
  5. Random Forest: Introduction & Implementation in Python
  6. Boosting: Introduction
  7. AdaBoost: Introduction, Implementation and Mathematics behind it.
  8. Gradient Boosting: Introduction, Implementation, and Mathematics behind it - For Regression


References and Other Materials

https://www.frontiersin.org/journals/neurorobotics/articles/10.3389/fnbot.2013.00021/full

要查看或添加评论,请登录

Mahesh S.的更多文章

  • Gradient Boosting: Introduction, Implementation, and Mathematics behind it.

    Gradient Boosting: Introduction, Implementation, and Mathematics behind it.

    A beginner-friendly introduction and an implementation in Python. Introduction Gradient Boosting is powerful and one of…

  • Linear Time Complexity Explained

    Linear Time Complexity Explained

    Understanding Big O Notation. Have you ever written a for loop? let’s refresh our memory How many times do you think…

    4 条评论
  • AdaBoost: Introduction, Implementation and Mathematics behind it.

    AdaBoost: Introduction, Implementation and Mathematics behind it.

    A beginner-friendly introduction and an implementation in Python Introduction AdaBoost is one of the first ensemble…

    2 条评论
  • Boosting: Introduction

    Boosting: Introduction

    Machine learning is rapidly evolving, and so is the available data. One of the main challenges of Machine Learning is…

    1 条评论
  • Random Forest: Introduction & Implementation in Python

    Random Forest: Introduction & Implementation in Python

    As always, let's start with a question. Have you ever been in a situation where you needed the opinion of more than one…

    4 条评论
  • Decision Trees: Introduction

    Decision Trees: Introduction

    A beginner friendly introduction to Decision Trees Continuing our House Price Example: Imagine you are planning to buy…

    4 条评论
  • Regression: Evaluation Metrics/Loss Functions

    Regression: Evaluation Metrics/Loss Functions

    A beginner-friendly introduction to the Evaluation Metrics of Regression. Whenever we create a model, we need to check…

    1 条评论
  • Linear Regression: Introduction

    Linear Regression: Introduction

    Let’s start with a question: Have you ever wondered how the Price of a house is predicted? Or have you ever tried to…

    6 条评论
  • Why Should I Learn from the Beginning?

    Why Should I Learn from the Beginning?

    And have a strong foundation. (from an AI/ML perspective) When you start learning Machine Learning, one of the first…

    2 条评论

社区洞察

其他会员也浏览了