登录查看更多内容

BxD Primer Series: Bagging Ensemble Models

Mayank K.

Founding Partner - BUSINESS x DATA (Implementing AI-Driven Personalization at Scale)

发布日期: 2023年6月7日

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Bagging Ensemble Models. Let’s get started:

The What:

Bagging (short for?Bootstrap?aggregating) works by randomly selecting a subset of training data (with replacement) and training a model on that subset. This process is repeated multiple times, with each iteration producing a new model. Then, all trained models are combined by averaging their predictions (in regression or probabilistic problems) or taking a majority vote (in classification problems).

Bagging reduces overfitting and increase the stability and accuracy of final model. It can help to improve the performance of models that are prone to high variance, such as decision trees.

Number of models to include, also known as bagging size, in the ensemble is an important parameter to tune in bagging. A larger number of models will generally produce a more accurate ensemble model, but at the cost of increased complexity and training time.

Bagging is used with a variety of machine learning algorithms, including decision trees, random forests, K-nearest neighbors (KNN), support vector machines (SVM), and neural networks.

We will cover a random forests in more detail below:

Random Forests:

Random forests are an extension of the decision tree algorithm (check our edition on decision trees?here ?and?here ) that used concept of bagging and feature randomization. They are widely used for a variety of applications including classification, regression, and feature selection.

The name "random forest" comes from the fact that each decision tree in the ensemble is trained on a random subset of data and a random subset of features.

To build a random forest:

Create large number of decision trees, each of which is trained on a different subset of training data. This reduces variance of model and improve its generalization performance.
At each split of decision tree, instead of considering all of the available features, randomly select a subset of features to consider. This reduces overfitting by introducing additional randomness into the model.

To make a prediction, simply aggregate the predictions of all decision trees in ensemble:

For classification problems, use a majority vote to determine the final prediction.
For regression problems, average the predictions to get the final prediction.

Advantages of random forests:

Ability to handle high-dimensional data with many features.
Robust to noise and outliers in data, which makes them a good choice for real-world applications.

Disadvantages of random forests:

Because a random forest is an ensemble of decision trees, it can be difficult to interpret how the model is making its predictions.
Computationally expensive to train.

Main hyper-parameters of random forests, which are tuned using grid or random search or bayesian optimization with k-fold validation:

n_estimators: The number of decision trees in the forest.
max_depth: The maximum depth of each decision tree.
max_features: The maximum number of features to consider at each split.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.
criterion: The function used to measure the quality of a split (e.g. Gini impurity or entropy).
min_impurity_decrease: The minimum impurity decrease required to split an internal node.
ccp_alpha: The complexity parameter used for post-pruning the decision trees to reduce overfitting.
max_leaf_nodes: The maximum number of leaf nodes in each decision tree.

Doug Rose 5 个月前

Exploring the Function of Sigmoid Neurons in Neural…

Doug Rose 4 个月前

Breakthrough: Zero-Weight LLM for Accurate Predictions…

Vincent Granville 6 个月前

Using Out-of-Bag as Substitute for Validation Set:

When we create a bagging ensemble, we randomly select a subset of the training data (with replacement) to train each model in the ensemble. This means that some of the data points in the original dataset are not used to train any of the models in the ensemble. These unused data points are referred to as "out-of-bag" (OOB) samples.

The out-of-bag error (OOB error) is calculated by evaluating the performance of each model in the ensemble on the out-of-bag samples that were not used to train that particular model. By averaging the performance of all the models on their respective out-of-bag samples, we can get an estimate of the generalization error of the entire ensemble model.

It allows us to evaluate the performance of model without need for a separate validation set. This is useful when the size of dataset is limited.
It can also be used to tune the hyper-parameters of bagging ensemble model. For example, we can vary the number of estimators in ensemble and choose the value that gives lowest out-of-bag error.

Estimating Feature Importance:

Several methods can be used to determine the ranking of features based on their contribution to predictive performance of ensemble model. Most common methods are:

? Mean Decrease Impurity (MDI): It is commonly used for tree-based ensemble models such as random forests and gradient boosting machines. MDI calculates the total reduction in impurity (e.g., Gini impurity or entropy) that is achieved by splitting on a particular feature, averaged over all trees in the ensemble. Features with higher MDI scores are considered to be more important.

? Permutation Feature Importance (PFI): It is a permutation-based method that randomly shuffles the values of each feature in test set and measures the decrease in performance of model. This process is repeated for each feature. The idea is that if a feature is important, shuffling its values should lead to a significant drop in accuracy. Features with higher PFI scores are considered to be more important.

Where,

X_j is the j’th feature
m is the number of test samples
Acc is the accuracy of model on test set
Acc_i(X_j) is the accuracy of model on test set when values of feature?j?are randomly shuffled for sample?i

? SHapley Additive exPlanations (SHAP): SHAP is a game-theoretic method that assigns an importance score to each feature based on the contribution of that feature to prediction for each instance in test set. SHAP values can be used to explain the predictions of individual instances, as well as to estimate feature importance at a global level.

SHAP, can also provide additional insights into how each feature contributes to the model's output, which can be useful in understanding the model's behavior and identifying areas for improvement. Read detailed interpretation and calculation?here .

? Drop-Column Importance: This method estimates feature importance by training ensemble model with and without each feature and comparing the difference in performance. The idea is that drop in performance when a feature is removed indicates importance of that feature.

In general, it's a good idea to use multiple methods to estimate feature importance and compare the results to determine rank of feature important.

Time for you to support:

Reply to this article with your question
Forward/Share to a friend who can benefit from this
Chat on Substack with BxD (here )
Engage with BxD on LinkedIN (here )

In next coming posts, we will cover three more Ensemble Models: Boosting, Ensemble of Experts, Bayesian Model Averaging

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata ?#bxd ?#Bagging #Ensemble ?#timeseries ?#primer

BUSINESS x DATA

760 位关注者

要查看或添加评论，请登录

Mayank K.的更多文章

What we look for in new recruits?

2024年9月22日

What we look for in new recruits?

Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…
500+ Enrollments, ?????????? Ratings and a Podcast

2024年9月14日

500+ Enrollments, ?????????? Ratings and a Podcast

We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.
What you mean 'Build A Business'?

2024年9月7日

What you mean 'Build A Business'?

We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.
Why 'AI-Driven Personalization' niche?

2024年8月31日

Why 'AI-Driven Personalization' niche?

We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…
Entering the next chapter of BxD

2024年8月24日

Entering the next chapter of BxD

We are all in for AI Driven Personalization in Business. And recently we created a course about it.

1 条评论
We are ranking #1

2024年8月17日

We are ranking #1

We are all in for AI Driven Personalization in Business. And recently we created a course about it.
My favorites from the new release

2024年7月27日

My favorites from the new release

The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.
Many senior level jobs inside....

2024年7月7日

Many senior level jobs inside....

Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…
People need more jobs and videos.

2024年6月29日

People need more jobs and videos.

From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…
BxD Saturday Letter #202425

2024年6月22日

BxD Saturday Letter #202425

Please take 2 mins to send your feedback. Link: https://forms.

See all articles

BxD Primer Series: Bagging Ensemble Models

Mayank K.

Founding Partner - BUSINESS x DATA (Implementing AI-Driven Personalization at Scale)

The What:

Random Forests:

领英推荐

Using Out-of-Bag as Substitute for Validation Set:

Estimating Feature Importance:

Time for you to support:

BUSINESS x DATA

760 位关注者

Mayank K.的更多文章

社区洞察

其他会员也浏览了

Perceptron Based Linear Regression model

A Comprehensive Overview of Classification Methods

Decoding the Future: A Deep Dive into Time Series Forecasting and Anomaly Detection

What is the future of artificial intelligence?

Long Short-Term Memory explained

How are artificial neural networks used in machine learning?

The Quest for Interpretable Machine Learning Models

The Emergence of Machine Learning in Forecasting– a Field Where Statistical Models Dominate

Activation functions. Sparking Neurons to Life: The Unsung Heroes of AI

From Simple Linear Regression to Powerful Neural Networks: Unleashing the Potential of Modern Tech

The What:

Random Forests:

领英推荐

Using Out-of-Bag as Substitute for Validation Set:

Estimating Feature Importance:

Time for you to support:

BUSINESS x DATA

760 位关注者

Mayank K.的更多文章

What we look for in new recruits?

500+ Enrollments, ?????????? Ratings and a Podcast

What you mean 'Build A Business'?

Why 'AI-Driven Personalization' niche?

Entering the next chapter of BxD

We are ranking #1

My favorites from the new release

Many senior level jobs inside....

People need more jobs and videos.

BxD Saturday Letter #202425

社区洞察

其他会员也浏览了

Perceptron Based Linear Regression model

A Comprehensive Overview of Classification Methods

Decoding the Future: A Deep Dive into Time Series Forecasting and Anomaly Detection

What is the future of artificial intelligence?

Long Short-Term Memory explained

How are artificial neural networks used in machine learning?

The Quest for Interpretable Machine Learning Models

The Emergence of Machine Learning in Forecasting– a Field Where Statistical Models Dominate

Activation functions. Sparking Neurons to Life: The Unsung Heroes of AI

From Simple Linear Regression to Powerful Neural Networks: Unleashing the Potential of Modern Tech