Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking
credit - medium.com

Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

Decision Trees

·       They require very little data preparation

·       Don’t require feature scaling or centering at all

·       Gini measures impurity. A node is pure (Gini = 0) if all training instances it applies to belong to the same class

·       Gini Impurity is slightly faster to compute

·       Decision trees make very few assumptions about the training data

·       Parametric – It has a predetermined number of parameters, so its degree of freedom is limited.

·       Non-parametric – The number of parameters is not determined prior to the training, so the model structure is free to stick closely to the data

·       Regularization parameters for Decision tree

o  Max_depth – Maximum depth of the decision tree

o  Min_samples_split – Minimum number of samples a node must have before it can be split

o  Min_samples_leaf – Minimum number of samples a leaf node must have

o  Min_weight_fraction – A fraction of the total number of weighted instances

o  Max_leaf_nodes – Maximum number of leaf nodes

o  Max_features – Maximum number of features that are evaluated for splitting at each node

o  Increasing the min_* hyperparameters or reducing max_* hyperparameters will regularize the model

·       Other category is pruning the unnecessary nodes

·       Regression – The predicted value for each region is always the average target value of the instances in that region

·       Regression splitting – It splits the training set in a way to minimize the MSE

·       They love orthogonal decision boundaries (all splits are perpendicular to an axis). This makes the algorithm sensitive to training set rotation

·       PCA often results in a better orientation of the training data

·       They are very sensitive to small variations in the training data

·       A node’s Gini impurity is generally lower than it’s parents. This is due to the CART algorithm’s cost function, which splits each node in a way that minimizes the weighted sum of its children’s Gini impurities


Ensemble Learning and Random Forests

·       A group of predictors is called an ensemble

·       Train a group of Decision Tree classifiers, each on a different random subset of the training set

·       Even if each classifier is a weak learner, the ensemble can still be a strong learner provided there are a sufficient number of weak learners and they are sufficiently diverse

·       Ensemble methods work best when predictors are as independent of one another as possible

·       Bagging – If we use the same training algorithm for every predictor and train them on different random subsets of the training set, with replacement, it is called Bagging. If sampling is performed without replacement, it is called pasting.

o  The net result of the ensemble is that it has a similar bias but a lower variance than a single predictor trained on the original training set

o  Training and prediction are distributed.

o  For Bagging, the bias will be higher and variance lower than pasting

o  BaggingClassifier also supports sampling the features as well – max_features and bootstrap_features

o  Sampling both training instances and features are called Random patches method and sampling features are called Random subspaces method

o  n_jobs = -1 means it will use all the cores available

           

Random forest

·       Using random thresholds for each feature rather than searching for the best possible thresholds are called Extreme Randomized trees (Extra trees). It trades bias for variance and the training will be fast

·       It measures feature importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest) and scaling the results so that the sum of all importance is equal to 1


Boosting

·       Ensemble method that can combine several weak learners into a strong learner

·       Predictors are trained sequentially, each trying to correct its predecessor

·       Adaboost

o  Adds predictors to the ensemble, gradually making it better

o  Predictors have different weights, depending on the overall accuracy of the weighted training set

o  It cannot be parallelized or only partially

o  The weighted error rate is computed on the training set. The predictor’s weight is computed using the weighted error rate. The more accurate the predictor, the higher its weight. Next, the instance weights are calculated, which boosts the weight of the misclassified instances

o  A new predictor is trained using the updated weights

o  The process continues till the desired number of predictors are reached or perfect predictor is found

o  AdaBoost simply computes the predictions of all the predictors and weighs them using the predictor weights αj. The predicted class is the one that receives the majority of weighted votes

o  Decision stumps – A decision stump is a decision tree with max_depth 1

o  A forest of stumps – A node and two leaves

o  Trumps are weak learners

o  Some stumps have more say in final classification than others. Weight (amount of say) for a stump = ?*log (1-total error / total error). The total error will be between 0 and 1.  The weight will be high when errors are low, it will be zero for random guessing (0.5) and negative if the errors are very high.

o  The order is important in a forest of stumps for Adaboost – The errors made by a stump influences the future stumps

o  The formula for increasing the sample weight for the samples that are incorrectly classified is sample weight * e amount of say

o  If the amount of say is high, then the sample weight is accordingly scaled. If the amount of say is less (the stump did not do a good job of classifying) then the sample weight will only be a little larger than the old one

o  The sample weights of correctly classified instances will be decreased using the weight * e - the amount of say

o  Normalize the new sample weights to make it add up to 1. These weights will be used by the next stump. The sample weights can be used to calculate weighted Gini indexes to determine which variable should be split in the next stump. Alternatively, a new collection of samples that contain duplicate copies of the samples with the largest sample weights can be prepared. We pick random numbers and add samples to the new collection until the new collection is the same size as the original.

o  The new collection is provided equal sample weights as before

o  The process is repeated to find the stump that does the best job classifying the new collection of samples

o  The final prediction is made by considering the weights (amount of say) of stumps and their prediction.  For example, if the sum of weights for “yes” is higher than the sum of weights for “No” (for all the stumps) then the final prediction will be “yes”.



Gradient Boosting

·       Instead of tweaking the instance weights at every iteration, it fit the new predictor to the residual errors made by the previous predictor

·       This applies to Regression as well which is called – (Gradient Boosted Regression Trees – GBRT)

o  It starts by making a single leaf. This represents an initial guess for the values of the samples. (Average value)

o  Pseudo residuals are calculated. (observed weight – predicted weight)2 – This loss function is selected to make derivative with respect to predicted easier to derive

o  Then Gradient boost builds a tree (usually larger than a stump, but the size is restricted). This tree will use the independent variables to predict the residuals. (In practice, the number of leaves is set from 8 to 32). By restricting the total number of leaves, we get fewer leaves than residuals.

o  The residuals which are a part of the same leaf is averaged

o  Then the first leaf (from the initial step) is combined with the tree for prediction

o  Gradient boost scales all the trees by the same amount

o  To avoid over-fitting, a learning rate (between 0 and 1) is selected to scale the contribution from the new tree. This will result in a small step in the right direction

o  Empirical evidence shows that taking a lot of small steps in right direction results in better predictions on test data i.e lower variance

o  This process is repeated until it reaches the number of trees we asked for or adding additional trees does not significantly reduce the size of the Residuals


·       Gradient Boost – Classification

o  Start with a leaf – The initial prediction for every individual is the log(odds)

o  Convert the log(odds) to probability

o  Classify everyone in the training dataset based on the probability

o  Calculate the pseudo residuals – (Observed – predicted) values

o  Calculate the output values for each leaf. As the values are based on log(odds) and probability we cannot just add the values like in regression.

o  The formula for transformation = ∑Residuals / ∑ [previous probability * (1 – previous probability)]

o  Combine the output from the initial leaf and the new tree with learning rate for the prediction

o  This will provide the new log(odds) for the prediction. Convert this into probability

o  Repeat this process until the specified number of trees are reached or the residuals become very small


·       Shrinkage - If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called shrinkage.

·       Stochastic Gradient Boosting – Each tree is trained on 25% of training instances, selected randomly.

·       Extreme Gradient Boosting (xgboost) – Optimized implementation of the gradient boosting algorithm

o  Excellent execution speed and model performance

o  It uses gradient descent to minimize the loss when adding new models

xgboost for Regression: -

o  Make an initial prediction – This is same regardless of if it is for regression or classification – which is 0.5

o  All the residuals go to the leaf. Calculate quality or similarity score for the residuals = sum of Residuals, squared / (number of residuals + lambda) (lambda = regularization parameter)

o  We see if we can do a better job clustering similar residuals if we split them into two groups

o  We take a threshold for the node and calculate the similarities for the leaves. If the residuals in a node are very different, they cancel each other out and the similarity score is relatively small.

o  To quantify how much better the leaves cluster similar residuals than the root, we calculate the gain of splitting the residuals into two groups

o  Gain = Left (similarity) + Right (similarity) – Root (similarity)

o  We compare the gain calculated for the other thresholds. Higher gain is desired. The threshold with higher gain is selected for splitting the observations

o  The default is to create a tree with depth 6.

o  The tree is pruned based on its Gain values

o  We start by picking up a value (gamma).

o  We calculate the difference between the Gain associated with the lowest branch in the tree and the value of gamma. If the value is negative, we will remove the branch. Like this we will move up the nodes in a tree

o  When lambda is greater than zero, the similarity scores are smaller. The amount of decrease is inversely proportional to the number of residuals in the node.

o  When lambda is greater than zero it is easier to prune leaves because the values for the gain are smaller

o  Setting gamma to zero does not avoid pruning

o  Output value = sum of residuals / (Number of residuals + lambda)

o  The prediction will be the sum of initial prediction and product of learning rate and output value for the tree

o  The name of the learning rate in XGboost is eta and the default value is 0.3

o  In this manner, if we keep on building trees the residuals will keep shrinking.


xgboost for classification: -

o  Make a prediction – 0.5

o  The residuals inform who good is the prediction

o  Similarity score formula = sum of residuals, squared / sum [previous probability * (1 – previous probability)] + lambda

o  Calculate the similarity scores using the residuals

o  Choose a threshold, calculate the similarities for the leaf on left and right

o  Calculate the gain similar to XGBOOST regression

o  The threshold for the minimum number of residuals in each leaf – cover

o  The cover is defined as the denominator of the similarity score minus lambda

o  The pruning is similar to XGBOOST regression using gamma

o  The output value for the leaf = sum of the residuals / sum [previous probability * (1 – previous probability)] + lambda

o  To calculate the prediction, calculate the log odds of initial prediction add the output of the tree scaled by the learning rate

o  Convert the log odds to probability

o  The continue this till we reach the max number of trees or the residuals become very small

xgboost Optimizations:

o  Uses the greedy algorithm to build trees (because the split is dependent on gain and it does not worry about how the leaves will be split later). It makes a decision without looking ahead to see if it is the absolute best choice in the long term. It builds the tree relatively quickly

o  Use quantiles as candidate thresholds to split the observations

o  Increasing the number of quantiles will provide more accurate predictions

o  To increase speed quantiles are tested in xgboost. The greedy algorithm uses about 33 quantiles

o  Sketch algorithms are used to find the approximate answers – Weighted quantile sketch. The sum of the weights in each quantile will be the same. The instances will have different weights according to the cover metric

o  The advantage of using the weighted quantile sketch is that we get smaller quantiles when we need them

o  It uses parallel learning to split up the dataset so that multiple computers can work on it at the same time

o  Sparsity-aware split finding tells us how to build trees with missing data

o  Cache-aware access – XGBOOST puts gradients and hessians in the cache so that it can rapidly calculate similarity scores and output values

o  Blocks for out of core computation – The data stored in the hard drive is compressed. xgboost will spend a little bit of CPU time uncompressing the data. This way it avoids spending a lot of time accessing the hard drive

o  If more than one hard drive is available, it uses sharding for faster access

Stacking

·       Instead of using Hard / Soft voting, this uses a model (called a blender or meta learner) to predict the aggregation

·       Training set is split into two subsets. The first subset is used to train the predictors in the first layer. The predictors from the first subset is used on the second subset to get the predictions. The predictions on the second subset along with the target variables are used to train a machine-learning algorithm

·       Multiple blenders can be created in such a way to get the final predictions

This note is prepared using the book - Handon Machine learning with Scikit-learn and Tensorflow and Statquest youtube videos


Alessandro Kuz

Machine Learning & AI Enthusiast

1 年

Thank you for this resource, pretty clear overview of boosting and ensemble techniques, EXCATLY what I needed!

回复

要查看或添加评论,请登录

Thulasiram Gunipati的更多文章

社区洞察

其他会员也浏览了