AdaBoost: Introduction, Implementation and Mathematics behind it.

AdaBoost: Introduction, Implementation and Mathematics behind it.

A beginner-friendly introduction and an implementation in Python

Introduction

AdaBoost is one of the first ensemble learning techniques that was widely used. It was published in 1995, and it has become really popular ever since.

AdaBoost is a short form of Adaptive Boosting. Understanding AdaBoost is important because it helps us create a foundation for understanding the other boosting methods that follow and are widely used nowadays, like XGBoost.

AdaBoost combines multiple weak learners sequentially to create a single strong learner that can perform better


Each base learners focus on the mistake made by previous base learner and tries to minimize the error. In other words, tries to increase the Performance.

From the figure above:

Base Learner 2 will focus on Base Learner 1's mistakes and try to improve the overall model's performance.


Let's start with the implementation of the AdaBoost Algorithm.

This is it. This block of Python code is all you need to start with AdaBoost for a classification problem.

Notice that we are using a Decision Tree inside AdaBoost while creating a model.

AdaBoost Implementation in Python

Here is the Kaggle Notebook , where I have implemented a complete ML workflow, from importing data to creating a model and testing it using AdaBoost. Please go through it once if you haven't already.

How it actually works?

In a nutshell, AdaBoost works like this:

Takes data → gives equal sample weights to the observations → creates a decision stump → updates the weight based on that decision → gives more weight to the incorrectly predicted observations → Those incorrectly predicted observations (along with others) are used in the next learner → the process is repeated.

Let's break it down in detail.


1. Initially, the algorithm starts with taking all the observations and giving them equal weights.

Here, equal weights are given for each of the observations. 1/5 because we have five observations. and the equal value will be 1/5 for each of them.


2. We create our first weak learner using a Decision Tree. In AdaBoost, a decision tree with a depth of 1 is used.

This decision tree with depth 1 is called Decision Stump.

A Decision Stump is a weak learner because it cannot represent all the observations in our data and certainly cannot generalize.

Here, a decision stump is created for each feature—age, sex, cp—and then, based on the value of the criterion(entropy, gini, mean absolute error, etc.), we will choose our first decision stump.

For example, if we have 3 features, then we will have 3 decision stumps. Now, we will train our decision stumps, and whichever stump has less entropy/gini/mean absolute error will be chosen as our initial base learner.


3. At this step, we calculate the Total Error and Performance of the stump.

If the red-colored observations are incorrectly predicted by our base learner, then our total error will be:


Now, the Performance (also denoted as α(alpha)) of the stump is calculated by:

After putting the value of Total Error, we get the Performance value of 0.2027.


4. Now, we update the weight of each of the observations.In order to update the weight, we use two different formulas: one for correctly predicted observation and one for incorrectly predicted observation.

Updated weight for:

  1. Incorrectly predicted observation

Lets see an example:


  1. Correctly predicted observation

Lets see an example:

As you can see, the value of the new weight is increased from 0.2 to 0.245

The reason is to increase the weight for the incorrectly predicted observation so that they can be passed to the next base learner.

5. Now, we have to find a Normalized Weight to ensure the total weight is equal to 1. Then, we calculate the Cumulative Normalized Weight

As you can see in the table above, Normalized weight is calculated by


and Cumulative Normalized weight is just adding the normalized weight.

The idea behind creating the Cumulative Normalized weight is to create different Bins.

From the above table, we can see that we have five bins

According to this bin list, a New Dataset is created that the next Base Learner can use.


6. Creating a New Dataset

This is where it gets interesting. To create a new dataset, we find random values between 0 and 1. The total number of random values should be equal to the total number of observations.

We have 5 observations, so we will choose five random values. Let's say they are,

[0.25,0.4.0.5,0.35,0.82]

Most of the time, the incorrectly classified observation will be selected after creating the bins. Since the bins that contain incorrectly predicted observations have a larger range, the probability of it being selected is higher.

Now, we see where all those values fall in.

After looking at the bins table, We get these values [1,1,2,1,3]

Now our new dataset will be:

Based on this data, we will repeat the process and proceed with Next Base Learner.


When to Stop?

The main question is: When to stop? How many learners should we create? It's totally up to us. But Scikit-learn has 50 as the default number for this.

This means that in scikit-learn, there will be 50 base learners by default.

After the training stops, we will have 50 weak learners (decision stumps) and their Performance values (α)—this is also called the weak learner's weight.

Now, when data is passed on to this, all the learners will make their own predictions. That means we will have 50 predictions.



How to get the final prediction value?

We can get the final prediction value from the formula:

Please note that we are discussing a binary classifier here (-1 and 1).
We can adapt this for Regression and Multi-class problems. AdaBoot.MH , SAMME, AdaBoost.R2 are some of the algorithms that are adapted for regression and multi-class problems.

Example:

let's say that,

we have 3 base learners f1, f2 and f3.

heir performance/weights are α1, α2, and α3 with values 0.2,0.4, and 0.6 respectively.

let's say that our input x is 0.5

Then, the prediction(f) of our overall model will be:

Advantages of AdaBoost

  1. Less prone to overfitting
  2. Reduces Bias
  3. Increases accuracy of the model


I hope this article was helpful for you in understanding AdaBoot and how it works.

We will talk about Gradient Boosting Next.


This is a 7th article in the Series Forming a strong foundation. Here are the links to the previous articles:

  1. Why Should I Learn from the Beginning?
  2. Linear Regression: Introduction
  3. Regression: Evaluation Metrics/Loss Functions
  4. Decision Tree: Introduction
  5. Random Forest: Introduction & Implementation in Python
  6. Boosting: Introduction


References and More Resources:

Freund, Y., & Schapire, R. E. (1999). A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence, 14(5), 771–780. https://cs.baylor.edu/~sloan/files/boosting.pdf

Jainvidip. (2024, July 4). Understanding AdaBoost - JainVidip - Medium. Medium. https://medium.com/@jainvidip/understanding-adaboost-a3e68c72ac83

Krish Naik. (2019, August 31). What is AdaBoost (BOOSTING TECHNIQUES) [Video]. YouTube. https://www.youtube.com/watch?v=NLRO1-jp5F8

Wizards, D. S. (2023, July 8). Understanding the AdABoost Algorithm - Data Science Wizards - medium. Medium. https://medium.com/@datasciencewizards/understanding-the-adaboost-algorithm-2e9344d83d9b

https://en.wikipedia.org/wiki/AdaBoost

Leo Walker

Data Scientist | Military Veteran

3 个月

I enjoy the progression of your articles, starting with simple concepts and gradually building on them. I'm looking forward to the next one!

要查看或添加评论,请登录

Mahesh S.的更多文章

  • Gradient Boosting: Introduction, Implementation, and Mathematics behind it - For Classification

    Gradient Boosting: Introduction, Implementation, and Mathematics behind it - For Classification

    A detailed beginner friendly introduction and an implementation in Python. Gradient Boosting(GB) is an ensemble…

  • Gradient Boosting: Introduction, Implementation, and Mathematics behind it.

    Gradient Boosting: Introduction, Implementation, and Mathematics behind it.

    A beginner-friendly introduction and an implementation in Python. Introduction Gradient Boosting is powerful and one of…

  • Linear Time Complexity Explained

    Linear Time Complexity Explained

    Understanding Big O Notation. Have you ever written a for loop? let’s refresh our memory How many times do you think…

    4 条评论
  • Boosting: Introduction

    Boosting: Introduction

    Machine learning is rapidly evolving, and so is the available data. One of the main challenges of Machine Learning is…

    1 条评论
  • Random Forest: Introduction & Implementation in Python

    Random Forest: Introduction & Implementation in Python

    As always, let's start with a question. Have you ever been in a situation where you needed the opinion of more than one…

    4 条评论
  • Decision Trees: Introduction

    Decision Trees: Introduction

    A beginner friendly introduction to Decision Trees Continuing our House Price Example: Imagine you are planning to buy…

    4 条评论
  • Regression: Evaluation Metrics/Loss Functions

    Regression: Evaluation Metrics/Loss Functions

    A beginner-friendly introduction to the Evaluation Metrics of Regression. Whenever we create a model, we need to check…

    1 条评论
  • Linear Regression: Introduction

    Linear Regression: Introduction

    Let’s start with a question: Have you ever wondered how the Price of a house is predicted? Or have you ever tried to…

    6 条评论
  • Why Should I Learn from the Beginning?

    Why Should I Learn from the Beginning?

    And have a strong foundation. (from an AI/ML perspective) When you start learning Machine Learning, one of the first…

    2 条评论

社区洞察

其他会员也浏览了