登录查看更多内容

The Strength of Weak Learnability

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

发布日期: 2022年4月26日

1. Introduction

I have decided to have my next set of posts on the Boosting algorithms - AdaBoost(Adaptive Boosting) | Gradient Boosting |?XGBoost (eXtreme Gradient Boosting). Needless to mention, there are several excellent blogs, YouTube videos and text books on the topic. However, these notes are meant to reinforce my understanding on the family of boosting algorithms - I'm more than happy if the conent is useful to other data science enthusiasts around the globe.

Let us start from the roots!

2. Boosting – Background and Philosophy

Strong Learners vs Weak Learners

Complex problems in areas of Artificial Intelligence (AI) may be solved using “strong learners” as neural network models. However, models such DNNs require (a) more learning parameters – depending upon the size of the network (b) more data to make the modern learn well in order to predict more generically (c) high hardware requirement. No doubt that solving such problems is the direction in which we’re moving especially in area of Deep Learning Research – however, there are several instances that we will come across situations wherein we have some limitations – limitations in terms of the data available for training a model, limitations in terms of hardware available, etc. It is in such instances that “weak learners” come to our rescue!

Let us now revisit the abstract and the takeaway of the paper by Robert E. Schapire titles “The Strength of Weak Learnability” – the paper dates back to 1990(!) but the takeaway from the paper is very relevant even today – I have also attached the paper. Schapire states in the paper: “A concept class is learnable (or, strongly learnable) if given access to a source of examples of the unknown concept, the learner with high probability is able to output a hypothesis that is correct but on all but on an arbitrary small fraction of the instances. A concept/class is weakly learnable if the learner can produce a hypothesis that performs only slightly better than random guessing

?Above,

Hypothesis ~ equation with its weights/parameters

Takeaway of the paper:

In the paper, it has been shown that 2 notions of learnability are equivalent. In other words, if, a strong learner can solve a problem than a week learner should also solve it. The notion of weak learnability was introduced by Kearns and Valiant (1988,1989) who left open the question of whether the notions of strong and weak learnability are equivalent. The question was termed as: “hypothesis boosting mechanism” since showing the notions are equivalent requires a method of boosting the low accuracy of a weak learning hypothesis.

In the “Hypothesis Boosting Mechanism” instead of constructing one hypothesis we construct more than one hypothesis and have them all make prediction and then go by the majority vote. Thus, combining the hypothesis resulted in making stronger prediction which helped in improving the accuracy and thus?making the notion of "weak learnability" equivalent equivalent to strong learnability

3. Algorithms based on the concept of Hypothesis Boosting

3.1 Random Forests

In the content above, I have discussed about the notion of strong and weak learnability and how these concepts inspired the discovery of the “Hypothesis Boosting Mechanism” which essentially meant combining more than one hypothesis for boosting the accuracy of an otherwise - single – weak hypothesis for solving a learning problem.

Let us now and in the subsequent sections look at some of these algorithms in some detail: Random Forests, Adaptive Boosting (AdaBoost), Gradient Boosting and XG Boost (eXtreme Gradient Boosting) .

Decision Trees and Random Forests:

Decision trees are the building blocks of Random Forests – a good source for the details on the Decision Trees is the MIT Lecture by Tamara Broderick (https://youtu.be/ZOiBe-nrmc4).

Since the Decision Trees are sensitive to the data on, the Random Forests algorithm randomly samples from the data set and uses each of the sample (each sample normally being of the same size N) for training each decision tree.

Another very important to highlight different between Random Forests and Decision Trees is that – in a regular Decision Tree algorithm the ordering of features selected during each split of the tree is based on the minimum entropy (the term “entropy” denoting the measure of randomness in the prediction) that is: we select those features first which would result in a maximum “Information Gain” . Contrary to this, in Random Forests, we randomly select / sample the features for each tree.

Thus, in Random Forests the trees are trained on random samples of data sets as well as using a random distribution of features amongst the trees and then the class with the majority vote amongst all the trees is used for prediction.

Why Randomness helps in Random Forests?

The Random distribution of samples from the data sets amongst the trees as well as random distribution of features proves powerful in getting a good accuracy – the higher the randomness i.e. more the trees better will be the confidence in prediction.

This can be treated analogous to a Monte Carlo analysis. Considering an example, we’re using a random number generator to produce a number between 0 and 100 and betting on the number being 0 and 60. Greater the number of times you play the game greater will be the possibility of winning and doing a Monte-Carlo simulation on such a scenario will predict this – grater the number of times you play of using the random number generator to produce a number – the greater will be number of wins. This is the situation with the randomness in Random Forests – more the trees, more will be the random distribution of features and the data set samples and thus higher will be the confidence in prediction.

3.2 Adaptive Boosting

From the above content, it can generally be said that weak learners can be combined to make a strong prediction. Let us now look at another algorithm – “Adaptive Boosting” which also based on the notion of combining weak learners so that they become equivalent to a “strong learner”.

In the case of Random Forests, the weak learner was a full-sized Decision Tree, and it was mentioned in section 3.1 above that we sample the data set randomly and the features (randomly)amongst the trees. Each Decision tree has an equal say in the final decision.

In Adaptive Boosting we make the process more efficient. In Adaptive Boosting the overall model split into N decision trees (precisely N weak models – let us consider that the weak learner is a decision tree) – each decision tree is a decision stump (see figure below) with a single split – N here refers to the number of features. Thus, we create a decision stump for every feature and the first decision stump will be the one with the lowest measure of randomness (Entropy / Gini Index).

All the data samples are assigned weights and to start with all data samples are weighted equally. Onve we have trained on these data samples we then assign higher weights to those samples that are misclassified. Now the second decision stump (or weak model) will focus more on those samples which are misclassified. The weights or the importance of the hypothesis get updated in every instance of training and the final classifier used to make stronger decisions

领英推荐

#34 Deep Learning Essentials: Multi-task Learning &…

Towards AI 7 个月前

Low-Budget Judge for High-End Hallucination Verdicts

Log10.io 3 个月前

How To Use MagicSchool AI For Report Card Comments?

The-Next-Tech 1 个月前

As it can be interpreted (from the above paragraphs and the section on Random Forests), in Random Forests it is possible to parallelize jobs on a multiprocessor machine because each decision tree is made independent of the other – in Adapboost it is clear from the above explanation that the order of the stumps do matter. The error the first tree (decision stump) makes influences the weights of the training samples in the second stump because each stump is made by taking into account the mistakes of the preceding stump?

Figure: Decision Stump (source: wikepedia)

3.3 Gradient Boosting

The above sections emphasized on the concept of strong and weal learners [https://lnkd.in/eBgXxJzr] followed by the discussion of some algorithms comprisiong of series of weak learners such as Random Forests [https://lnkd.in/eiFFrVSM] and AdaBoost [https://lnkd.in/eBgj3uYS]

In fact, Adaptive Boosting discussed in section 3.2 above may be considered as a special case of “Gradient Boosting”. In Adaptive Boosting the weighting of the data points is done by an exponential curve which means that if a training sample is misclassified by a weak learner, we really weight it high so that the next weak learner focusses on getting it right. The weighting of the Data Points in AdaBoost is shown in the mathematically equation below:

The total error is the summation of all the sample weights of misclassified data points.

Contrary to the above (the approach for calculating the weights of the misclassified samples) in Gradient Boosting the model learns the weights through the gradient. The gradient is used to minimize the loss function and “learns” the weights. In each round of the training the error (prediction vs the truth) is computed, and the gradient is calculated by the partial derivative of the loss function. Since in gradient boosting, we’re combining the prediction of multiple models, the gradients are additive and will be updated through the training process of each tree.

Having the weights learnt through the allows far more flexibility than updated the weights exponentially as in Adaptive Boosting.

One of the caveats in the gradient boosting is that the training process becomes slower because of the gradient computation especially when the training data is enormous, and this part is taken care of in XGBoost

XGBoost: eXtreme Gradient Boosting

XGBoost is Gradient Boosting but with some performance enhancements. XGBoost stores training data in compressed sparse column (csc) format and uses second order gradient to calculate the loss function and uses L, L2 regularization to prevent overfitting and make generalizations better. Training can be parallel/distributed across clusters

As such, XGBoost has been a cornerstone in competitive machine learning, being the technique used to win and recommended by winners. For example, here is what some Kaggle competition winners have said:

"As the winner of an increasing amount of Kaggle competitions, XGBoost showed us again to be a great all-round algorithm worth having in your toolbox."

-- Dato Winners’ Interview

"When in doubt, use xgboost."

-- Avito Winner’s Interview

About csc format: https://scipy-lectures.org/advanced/scipy_sparse/csc_matrix.html

4. List of tutorials (Jupyter notebooks)

?Here a set of simple tutorials that I compiled in Jupyter notebooks as an insight exercise with some sort of parametric study.

Exercise 1: Training a Decision Tree (Salary Dataset)
Exercise 2: Training a Decision Tree (Titanic Dataset)
Exercise 3: Training a Random Forest (Digits Dataset)
Exercise 4: Training a Random Forest (Iris Dataset)
01_Introduction to Gradient Boosting
02_Introduction to XGBoost
03_Developing an XGBoost Model (Pima Indians Diabetes Dataset)
04_Monitor Perfromance and Early Stopping of XGBoost Model (Pima Indians Diabetes Dataset)
05_Feature Importance with XGBoost(Pima Indians Diabetes Dataset)
06_Configuring an XGBoost Model (Pima Indians Diabetes Dataset)
07_Hyperparam,eter Tuning with XGBoost Model (Pima Indians Diabetes Dataset)

Github: https://lnkd.in/eUgjwgTs

??Disclaimer:

I have compiled these tutorials as a self learning/insightful exercise. The code may NOT be in the best form.

References:

The Strength of Weak Learnability
XGBoost with Python, Jason Brownlie
https://towardsdatascience.com/understanding-random-forest-58381e0602d2
https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/
https://www.analyticsvidhya.com/blog/2021/09/adaboost-algorithm-a-complete-guide-for-beginners/
https://lnkd.in/eAB4g8nJ
XGBoost With Python | Jason Brownlee
Several web sources

要查看或添加评论，请登录

Ajay Taneja的更多文章

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

2025年2月24日

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

1. Introduction: This article is the continuation of my series of articles on “Fine-Tuning of LLMs” and is the fourth…
Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

2025年2月10日

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning of LLMs and is the third blog in the…
Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

2025年2月4日

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning and is the second blog in the series.
Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

2025年2月1日

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

1. Fine Tuning Series and Background of Transformers and ChatGPT Training Process: One of my earlier series of blogs…
RAG Beyond Basics:

2025年1月7日

RAG Beyond Basics:

1. Introduction: In this article/blog, I will discussing some advanced techniques in the Retrieval-Augmented Generation…
The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

2024年10月24日

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

1. Introduction: The general idea of Retrieval-Augmented Generation (RAGs) is now well understood in LLM community and…

2 条评论
Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

2024年9月23日

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the 14th article in the series.

3 条评论
Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

2024年8月26日

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

1. Introduction: This is the continuation of my Graph Series of Blogs and is the thirteenth blog in the series.
Training Graph Neural Networks: Part 12 of my Graph series of blogs

2024年8月18日

Training Graph Neural Networks: Part 12 of my Graph series of blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the twelfth article in the series.
Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

2024年6月30日

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

1. Introduction: This article is the continuation of my series of blogs on “Graphs” and is the eleventh article in the…

See all articles

The Strength of Weak Learnability

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

1. Introduction

2. Boosting – Background and Philosophy

3. Algorithms based on the concept of Hypothesis Boosting

3.1 Random Forests

3.2 Adaptive Boosting

领英推荐

3.3 Gradient Boosting

4. List of tutorials (Jupyter notebooks)

Ajay Taneja的更多文章

社区洞察

其他会员也浏览了

Top 5 Mistakes to Avoid When Learning AI (and How DataRango Helps)

Top 7 Artificial Intelligence Courses in Mumbai with Fees 2025

AI Learning Digest: Empower the Next Wave of AI Enthusiasts

The 10 Best Free Artificial Intelligence And Machine Learning Courses for 2020

The Future of Tech: Free Google Courses on AI and Machine Learning in 2023

Machine learning: Are Machines Doomed to Study Forever? The Never-Ending Syllabus of AI

From Physio to Engineer: Lessons in Accelerated Learning from OpenAI's GPT-01 and Thinking, Fast and Slow

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Unlock AI Potential: Transfer Learning Essentials

Multi-task learning(MTL) with Multi-Layer Perceptron (MLP) and Deep Learning Techniques

1. Introduction

2. Boosting – Background and Philosophy

3. Algorithms based on the concept of Hypothesis Boosting

3.1 Random Forests

3.2 Adaptive Boosting

领英推荐

3.3 Gradient Boosting

4. List of tutorials (Jupyter notebooks)

Ajay Taneja的更多文章

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

RAG Beyond Basics:

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

Training Graph Neural Networks: Part 12 of my Graph series of blogs

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

社区洞察

其他会员也浏览了

Top 5 Mistakes to Avoid When Learning AI (and How DataRango Helps)

Top 7 Artificial Intelligence Courses in Mumbai with Fees 2025

AI Learning Digest: Empower the Next Wave of AI Enthusiasts

The 10 Best Free Artificial Intelligence And Machine Learning Courses for 2020

The Future of Tech: Free Google Courses on AI and Machine Learning in 2023

Machine learning: Are Machines Doomed to Study Forever? The Never-Ending Syllabus of AI

From Physio to Engineer: Lessons in Accelerated Learning from OpenAI's GPT-01 and Thinking, Fast and Slow

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Unlock AI Potential: Transfer Learning Essentials

Multi-task learning(MTL) with Multi-Layer Perceptron (MLP) and Deep Learning Techniques