Bagging , Random Forest and Adaboost
In the previous article , we discussed about ensemble learning and Voting ensemble. I will suggest firstly check the previous article before proceeding further.
Bagging also known as Bootstrap Aggregating is an ensemble technique aims to improve the accuracy and robustness of the predictive model. It combines the prediction of the multiple individual models(base learners) to make more accurate prediction than the individual model.
How it works ?
Bootstrap Sampling: Bagging starts by creating the subset of the training data(row sampling with replacement). Same values can occur multiple times. Sample data size should be less than original data.
Base model training: For each subset of data , base models are trained independently. It can be a simple model like Decision Tree or different ML algorithms.
Prediction Aggregation: After training base models, their predictions are combined to make a final ensemble prediction. In classification , final prediction is selected according to the majority voting and in regression , it is the average of all predictions from individual models.
Types of Bagging
Bagging helps in reducing the variance of the model.
Random Forest Algorithm: It is based on bagging used for both classification as well as regression in which base models are decision tree and data is sampled both with respect to row wise and column wise with replacement.
Only 2/3 data is used in subset and square root N is base from where we start to choose columns . You can clearly seee in the above image.
Why Random Forest perform well ?
Lets take an example of dataset having independent features as
Age| Salary| Commute distance
and dependent feature is Attrition(means whether the person will leave job or not).
As you can see , salary will always be an importamt factor over here. In terms of decision tree , it will be root node mostly.But in Random Forest , we are sampling the data with replacement , there will be the possibility that salary feature will not come in some base models (decision tree). So training will happen on other features. These features will also contribute significantly in final prediction. So we can say Random Forest capture the pattern of the data from various angles . That's why it gives good accuracy.
Difference between bagging and Random Forest
In bagging we can use different algorithms as base models .It uses all available features for each tree which can lead to high degree of correlation between base models.
In Random forest , we can only use decision tree in the base model. Each model is built parallelly from main data. Trees are independent of each other. It adds a extra layer of randomness by selecting a random subset of features at each node in the decision tree. This feature selection process decorrelate the trees. It reduces overfitting and improves generalization.
OOB(Out of Bag): It refers to the datapoints that are not included in the samples that are used to train the base models. We can use it as a validation dataset. It offers a advantage of unbiased model performance, because these are unseen data, it depicts how well the model is generalized.
Boosting: It improves the performance of the weak models by combining them into a strong predictive model. It works by sequentially training a series of weak models and assigning a more weights to the examples that are wrongly predicted by the previous model. Final prediction is the weighted combination of all the predictions from weak models.
Adaboost(Adaptive Boosting): In Random Forest decision tree of any desired length is used , but in Adaboost, a tree with two leaves and one root node known as stump is used.They form forest of stumps. They use only single features out of n features, so not great for making predictions. Hence they are known as weak learners.
Note: Adaboost explanation credit goes to StatQuest with Josh Starmer and This vlog .
Steps in Adaboost Algorithm
In this dataset , we have to predict whether the person has heart disease or not based on input features.
Step 1. Assign sample weights to records:
Initially all samples have equal weights so divide each sample by total number of records. Number of records are 8 , so each sample has a weight = 1/8.
step 2. Creating stumps in the forest :
First stump is created by taking chest pain as root node. Person having heart disease , 3 records are classified correctly and 2 as incorrectly . When person does not have heart disease, 2 records are classified correctly and 1 as incorrectly. Similarly create stumps for Blocked Arteries and Patient weight.
Their respective Gini index are also calculated. Value of the Gini index is low in case of weight which is 0.2 which signifies that the records classified incorrectly are less when weight is set as root node.
Step 3. Calculate the performance say : To calculate performance we check how many records are incorrectly classified. So in weight column record at index 4 which is 167 classified as incorrectly.
Error= Sum of sample weights of misclassified samples = 1/8
Performance say is quiet high whcih states accuracy of the stump is good.
Step 4. Update the sample weights : After calculating the error rate we proceed to reduce the error in the next stump by modifying the weights of the records. The records which resulted in misclassification will have increased weights and correctly classified records will have decreased weights. The formula to calculate the updated weights for correctly classified and incorrectly classified records is given below.
We have calculated the weights for correctly and incorrectly classified samples, so lets create the table again .
Step 5. Normalize the weights: In sample weight column if we sum up , the resultant value will be 1, so we need to ensure the newly updated weights sum should also be equal to 1 . So we can divide the each value with sum of total values. This is known as normalizing weights.
Remove the unnecessary columns from the dataset.
Step 6. Creation of new dataset as per bucket: Now we need to prepare the dataset that will be used in a new stump. As record 4 was incorrectlyclassified in stump 1 , we need to make sure that it should correctly classified in stump 2.
So we create a dataset that will consist of maximum records as record 4. To achieve the same we create a bucket of ranges to pull the record related to the random value generated.
Now we run a loop of random values between 0 and 1 and based on the output we classify it w.r.t the bucket. For E.g if we run the loop for the first time and get the output as 0.52 then we can make it out from the table that it falls in bucket 5. So we take record no 5 and insert it into the new dataset. Similarly, we will continue for 8 loops and based on the random values will pick up the record no associated with the bucket we obtain. The newly created dataset will have most of the records as record no 4 as it was incorrectly classified.
So we now again go back to repeat the step1 and assign the default weight to each of the records as (1/8). In this manner, we keep on constructing the stumps and reach a stage where all the record values are predicted correctly. The performance of the learner model or stump will be very high when the model reaches a stage where all the records in a dataset are classified correctly.
In this way, we are ready with the learners and their respective Performance say values indicating how powerful is the learner model or stumps. Now when we pass a new test record to these combinations of models then each of the models will generate their respective output stating whether the person may have heart disease or not. Consider the case below where for a given record a few of the learner models classify the record as having heart disease(1) and few of them term it as having no heart disease(0). We then sum up the Performance Say of models classifying the record as heart disease(1) and vice-versa calculate it for models classifying as no heart disease. The classifiers which have the highest Performance Say will be considered as the output result for a given record.
In the above image, the total Performance Say on the Heart Disease is more than that of not having a heart disease. Hence we conclude that for the given record details of a person there are high chances of having a heart disease.
So this is the entire flow of constructing the models from the scratch and getting it ready to test against a new dataset based on the majority voting policy with regards to Performance Say.
In next article , I will talk about Gradient boosting and XGBoost. So don't forget to see that article.
Thankyou for reading this article.
Special Thanks to Nitish Singh and Krish Naik .
Wow, diving into Bagging, Random Forest, and AdaBoost in such detail really showcases your commitment! It might be super useful to look into how these techniques can be applied to real-world problem-solving in various industries. How about exploring some case studies next? Can you see yourself specializing in one of these techniques in the future? Keep rocking that learning journey!