Aggregating Bootstrap (Bagging)
We discussed the Bootstrapping in my previous article. Based on that, we can take a look at another popular machine learning algorithm - Aggregating Bootstrap, also called Bagging. As it is described in Wikipedia, Bagging is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression.
Before we have a deep dive in Bagging, we need to know what the ensemble algorithm is. There is a Chinese saying “the wisdom of the masses exceeds that of the wisest individual” or simply as “two heads are better than one”. That is actually the core of the ensemble algorithm – multiple weak classifiers are better than one strong classifier (I use classifier instead of learner or other term because I will focus on the classification rather than regression in this article). Specifically, assuming that there is a certain degree of difference between the weak classifiers (such as different algorithms, or different parameters of the same algorithm), that will cause different classification results. In another words, the weak classifiers will make different mistakes, especially on the data points which are close to the decision boundary. But, if we can “somehow” combine the results from those weak classifiers to get a more reasonable boundary and reduce the overall error, we should be able to achieve better classification results. And that “somehow” is where the ensemble algorithm jumps in.
As mentioned above, Bagging is one of the machine learning ensemble meta-algorithms. Given the training data set D of size n, Bagging generates m new training sets Di, each of size n’, by sampling from D with replacement (bootstrap sampling). Then, it uses these m new training sets to train m weak classifiers and lastly it combines the outputs by voting for classification (by averaging the values for regression). That’s why this algorithm called aggregating bootstrap. The weak classifiers can be any algorithm/model, such as decision tree, SVM, etc. They don’t have to be same in one Bagging. The different weak classifiers would return different results. And the result of the most votes will be the final result of Bagging.
If we tweak Bagging a little bit, we will get another very powerful classifier called Random Forest. Compare to Bagging, Random Forest has the following unique features:
- The weak classifiers are all decision trees. That’s why it is a forest.
- For the decision trees, when selecting a split point, the learning algorithm is limited to a random sample of features of which to search, instead of searching all features and all feature values (if you are not familiar with decision tree algorithm, I have a previous article talked about it). That’s why it is called random.
Why is that helpful to split the decision tree in a random sample of features? Well, that’s because Bagging prefers its weak classifiers are uncorrelated or weakly correlated. Otherwise, if all classifiers provide the similar predications, a Bagging would approximate to a weak classifier. Unfortunately, the decision trees within a Bagging are likely correlated because decision tree algorithm is greedy. They all choose which variable to split on using a greedy algorithm that minimizes error. Thus, there is a chance that all decision trees within a Bagging have a lot of structural similarities and in turn have high correlation in their predications. Therefore, we force to randomly sample the features for the splitting to reduce the correlation of the decision trees.