The Strength of Weak Learnability
https://livebook.manning.com/book/grokking-machine-learning/chapter-10/v-9/14

The Strength of Weak Learnability


1. Introduction

I have decided to have my next set of posts on the Boosting algorithms - AdaBoost(Adaptive Boosting) | Gradient Boosting |?XGBoost (eXtreme Gradient Boosting). Needless to mention, there are several excellent blogs, YouTube videos and text books on the topic. However, these notes are meant to reinforce my understanding on the family of boosting algorithms - I'm more than happy if the conent is useful to other data science enthusiasts around the globe.

Let us start from the roots!

2. Boosting – Background and Philosophy

Strong Learners vs Weak Learners

Complex problems in areas of Artificial Intelligence (AI) may be solved using “strong learners” as neural network models. However, models such DNNs require (a) more learning parameters – depending upon the size of the network (b) more data to make the modern learn well in order to predict more generically (c) high hardware requirement. No doubt that solving such problems is the direction in which we’re moving especially in area of Deep Learning Research – however, there are several instances that we will come across situations wherein we have some limitations – limitations in terms of the data available for training a model, limitations in terms of hardware available, etc. It is in such instances that “weak learners” come to our rescue!

Let us now revisit the abstract and the takeaway of the paper by Robert E. Schapire titles “The Strength of Weak Learnability” – the paper dates back to 1990(!) but the takeaway from the paper is very relevant even today – I have also attached the paper. Schapire states in the paper: “A concept class is learnable (or, strongly learnable) if given access to a source of examples of the unknown concept, the learner with high probability is able to output a hypothesis that is correct but on all but on an arbitrary small fraction of the instances. A concept/class is weakly learnable if the learner can produce a hypothesis that performs only slightly better than random guessing

?Above,

Hypothesis ~ equation with its weights/parameters

Takeaway of the paper:

In the paper, it has been shown that 2 notions of learnability are equivalent. In other words, if, a strong learner can solve a problem than a week learner should also solve it. The notion of weak learnability was introduced by Kearns and Valiant (1988,1989) who left open the question of whether the notions of strong and weak learnability are equivalent. The question was termed as: “hypothesis boosting mechanism” since showing the notions are equivalent requires a method of boosting the low accuracy of a weak learning hypothesis.

In the “Hypothesis Boosting Mechanism” instead of constructing one hypothesis we construct more than one hypothesis and have them all make prediction and then go by the majority vote. Thus, combining the hypothesis resulted in making stronger prediction which helped in improving the accuracy and thus?making the notion of "weak learnability" equivalent equivalent to strong learnability

3. Algorithms based on the concept of Hypothesis Boosting

3.1 Random Forests

In the content above, I have discussed about the notion of strong and weak learnability and how these concepts inspired the discovery of the “Hypothesis Boosting Mechanism” which essentially meant combining more than one hypothesis for boosting the accuracy of an otherwise - single – weak hypothesis for solving a learning problem.

Let us now and in the subsequent sections look at some of these algorithms in some detail: Random Forests, Adaptive Boosting (AdaBoost), Gradient Boosting and XG Boost (eXtreme Gradient Boosting) .

Decision Trees and Random Forests:

Decision trees are the building blocks of Random Forests – a good source for the details on the Decision Trees is the MIT Lecture by Tamara Broderick (https://youtu.be/ZOiBe-nrmc4).

Since the Decision Trees are sensitive to the data on, the Random Forests algorithm randomly samples from the data set and uses each of the sample (each sample normally being of the same size N) for training each decision tree.

No alt text provided for this image

Another very important to highlight different between Random Forests and Decision Trees is that – in a regular Decision Tree algorithm the ordering of features selected during each split of the tree is based on the minimum entropy (the term “entropy” denoting the measure of randomness in the prediction) that is: we select those features first which would result in a maximum “Information Gain” . Contrary to this, in Random Forests, we randomly select / sample the features for each tree.

Thus, in Random Forests the trees are trained on random samples of data sets as well as using a random distribution of features amongst the trees and then the class with the majority vote amongst all the trees is used for prediction.

Why Randomness helps in Random Forests?

The Random distribution of samples from the data sets amongst the trees as well as random distribution of features proves powerful in getting a good accuracy – the higher the randomness i.e. more the trees better will be the confidence in prediction.

This can be treated analogous to a Monte Carlo analysis. Considering an example, we’re using a random number generator to produce a number between 0 and 100 and betting on the number being 0 and 60. Greater the number of times you play the game greater will be the possibility of winning and doing a Monte-Carlo simulation on such a scenario will predict this – grater the number of times you play of using the random number generator to produce a number – the greater will be number of wins. This is the situation with the randomness in Random Forests – more the trees, more will be the random distribution of features and the data set samples and thus higher will be the confidence in prediction.

3.2 Adaptive Boosting

From the above content, it can generally be said that weak learners can be combined to make a strong prediction. Let us now look at another algorithm – “Adaptive Boosting” which also based on the notion of combining weak learners so that they become equivalent to a “strong learner”.

In the case of Random Forests, the weak learner was a full-sized Decision Tree, and it was mentioned in section 3.1 above that we sample the data set randomly and the features (randomly)amongst the trees. Each Decision tree has an equal say in the final decision.

In Adaptive Boosting we make the process more efficient. In Adaptive Boosting the overall model split into N decision trees (precisely N weak models – let us consider that the weak learner is a decision tree) – each decision tree is a decision stump (see figure below) with a single split – N here refers to the number of features. Thus, we create a decision stump for every feature and the first decision stump will be the one with the lowest measure of randomness (Entropy / Gini Index).

All the data samples are assigned weights and to start with all data samples are weighted equally. Onve we have trained on these data samples we then assign higher weights to those samples that are misclassified. Now the second decision stump (or weak model) will focus more on those samples which are misclassified. The weights or the importance of the hypothesis get updated in every instance of training and the final classifier used to make stronger decisions

As it can be interpreted (from the above paragraphs and the section on Random Forests), in Random Forests it is possible to parallelize jobs on a multiprocessor machine because each decision tree is made independent of the other – in Adapboost it is clear from the above explanation that the order of the stumps do matter. The error the first tree (decision stump) makes influences the weights of the training samples in the second stump because each stump is made by taking into account the mistakes of the preceding stump?

No alt text provided for this image

Figure: Decision Stump (source: wikepedia)


3.3 Gradient Boosting

The above sections emphasized on the concept of strong and weal learners [https://lnkd.in/eBgXxJzr] followed by the discussion of some algorithms comprisiong of series of weak learners such as Random Forests [https://lnkd.in/eiFFrVSM] and AdaBoost [https://lnkd.in/eBgj3uYS]

In fact, Adaptive Boosting discussed in section 3.2 above may be considered as a special case of “Gradient Boosting”. In Adaptive Boosting the weighting of the data points is done by an exponential curve which means that if a training sample is misclassified by a weak learner, we really weight it high so that the next weak learner focusses on getting it right. The weighting of the Data Points in AdaBoost is shown in the mathematically equation below:

No alt text provided for this image

The total error is the summation of all the sample weights of misclassified data points.

Contrary to the above (the approach for calculating the weights of the misclassified samples) in Gradient Boosting the model learns the weights through the gradient. The gradient is used to minimize the loss function and “learns” the weights. In each round of the training the error (prediction vs the truth) is computed, and the gradient is calculated by the partial derivative of the loss function. Since in gradient boosting, we’re combining the prediction of multiple models, the gradients are additive and will be updated through the training process of each tree.

Having the weights learnt through the allows far more flexibility than updated the weights exponentially as in Adaptive Boosting.

One of the caveats in the gradient boosting is that the training process becomes slower because of the gradient computation especially when the training data is enormous, and this part is taken care of in XGBoost

XGBoost: eXtreme Gradient Boosting

XGBoost is Gradient Boosting but with some performance enhancements. XGBoost stores training data in compressed sparse column (csc) format and uses second order gradient to calculate the loss function and uses L, L2 regularization to prevent overfitting and make generalizations better. Training can be parallel/distributed across clusters

As such, XGBoost has been a cornerstone in competitive machine learning, being the technique used to win and recommended by winners. For example, here is what some Kaggle competition winners have said:

"As the winner of an increasing amount of Kaggle competitions, XGBoost showed us again to be a great all-round algorithm worth having in your toolbox."

-- Dato Winners’ Interview

"When in doubt, use xgboost."

-- Avito Winner’s Interview

About csc format: https://scipy-lectures.org/advanced/scipy_sparse/csc_matrix.html


4. List of tutorials (Jupyter notebooks)

?Here a set of simple tutorials that I compiled in Jupyter notebooks as an insight exercise with some sort of parametric study.

  1. Exercise 1: Training a Decision Tree (Salary Dataset)
  2. Exercise 2: Training a Decision Tree (Titanic Dataset)
  3. Exercise 3: Training a Random Forest (Digits Dataset)
  4. Exercise 4: Training a Random Forest (Iris Dataset)
  5. 01_Introduction to Gradient Boosting
  6. 02_Introduction to XGBoost
  7. 03_Developing an XGBoost Model (Pima Indians Diabetes Dataset)
  8. 04_Monitor Perfromance and Early Stopping of XGBoost Model (Pima Indians Diabetes Dataset)
  9. 05_Feature Importance with XGBoost(Pima Indians Diabetes Dataset)
  10. 06_Configuring an XGBoost Model (Pima Indians Diabetes Dataset)
  11. 07_Hyperparam,eter Tuning with XGBoost Model (Pima Indians Diabetes Dataset)

Github: https://lnkd.in/eUgjwgTs

??Disclaimer:

I have compiled these tutorials as a self learning/insightful exercise. The code may NOT be in the best form.

References:

  1. The Strength of Weak Learnability
  2. XGBoost with Python, Jason Brownlie
  3. https://towardsdatascience.com/understanding-random-forest-58381e0602d2
  4. https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/
  5. https://www.analyticsvidhya.com/blog/2021/09/adaboost-algorithm-a-complete-guide-for-beginners/
  6. https://lnkd.in/eAB4g8nJ
  7. XGBoost With Python | Jason Brownlee
  8. Several web sources

要查看或添加评论,请登录

Ajay Taneja的更多文章

社区洞察

其他会员也浏览了