ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Random Forest for Learning Imbalanced Data

Manish Prasad

Lead Data Scientist at Vista | VIT Alumnus | Causal Inference | Bayesian Statistics | NLP & Computer Vision |

å‘å¸ƒæ—¥æœŸ: 2019å¹´3æœˆ19æ—¥

+ å…³æ³¨

Using Random Forest to Learn Imbalanced Data

The most common difficulties while working on Classification is imbalanced data.

Many practical classification problems are imbalanced; i.e., at least one of the classes constitutes only a very small minority of the data. For such problems, the interest usually leans towards the correct classification of the â€œrareâ€ class (i.e. the â€œpositiveâ€ class). Examples of such problems include fraud detection, network intrusion, rare disease diagnosing, etc.

Problem with existing Classification algorithms

The most commonly used classification algorithms do not work well for Imbalanced data problems because they aim to minimize the overall error rate, rather than paying special attention to the positive class.

Handling Imbalanced Data

Cost-Sensitive Learning: Assigning a high cost to misclassification of the minority class, and minimizing the overall cost.
sampling technique: Either down-sampling the majority class or over-sampling the minority class, or both.

Solving Data Imbalance using Random Forest

we will use the above two techniques for solving Imbalanced using Random Forest.

Incorporating class weights into the RF classifier, thus making it cost sensitive, and it penalizes misclassifying the minority class
combining the sampling technique and the ensemble idea. It down-samples the majority class and grows each tree on a more balanced data set.

Random Forest

Random forest (Breiman, 2001) is an ensemble of unpruned classification or regression trees, induced from bootstrap samples of the training data, using random feature selection in the tree induction process. Prediction is made by aggregating (majority vote for classification or averaging for regression) the predictions of the ensemble. Random forest generally exhibits a substantial performance improvement over the single tree classifier such as CART and C4.5.

However, similar to most classifiers, RF can also suffer from the curse of learning from an extremely imbalanced training data set. As it is constructed to minimize the overall error rate, it will tend to focus more on the prediction accuracy of the majority class, which often results in poor accuracy for the minority class.

To handle the Imbalanced data using Random Forest, we can use two variation of Random Forest:

Balanced Random Forest

Random Forest induces each constituent tree from a bootstrap sample of the training data. In learning extremely imbalanced data, there is a significant probability that a bootstrap sample contains few or even none of the minority class, resulting in a tree with poor performance for predicting the minority class.

A naive way of fixing this problem is to use a stratified bootstrap; i.e., the sample with replacement from within each class. This still does not solve the imbalance problem entirely.

Over-Sampling v/s Down-Sampling

Making class priors equal either by down-sampling the majority class or over-sampling the minority class is usually more effective with respect to given performance measurement, and that downsampling seems to have an edge over over-sampling. However, down-sampling the majority class may result in a loss of information, as a large part of the majority class is not used.

The Balanced Random Forest (BRF) algorithm is to induce ensemble trees on balanced down-sampled data.

Balanced Random Forest (BRF) algorithm

For each iteration in Random Forest, draw a bootstrap sample from the minority class. Randomly draw the same number of cases, with replacement, from the majority class.
Induce a classification tree from the data to maximum size, without pruning. The tree is induced with the CART algorithm, with the following modification: At each node, instead of searching through all variables for the optimal split, only search through a set of m randomly selected variables.
Repeat the two steps above for the number of times desired. Aggregate the predictions of the ensemble and make the final prediction.

Weighted Random Forest

Another approach to make Random Forest more suitable for learning from extremely imbalanced data follows the idea of cost-sensitive learning.

Since the RF classifier tends to be biased towards the majority class, we shall place a heavier penalty on misclassifying the minority class. We assign a weight to each class, with the minority class given larger weight (i.e., higher misclassification cost).

The class weights are incorporated into the RF algorithm in two places:

In the tree induction procedure, class weights are used to weight the Gini criterion for finding splits.
In the terminal nodes of each tree, class weights are again taken into consideration. The class prediction of each terminal node is determined by â€œweighted majority voteâ€; i.e., the weighted vote of a class is the weight for that class times the number of cases for that class at the terminal node.

The final class prediction for RF is then determined by aggregating the weighted vote from each individual tree, where the weights are average weights in the terminal nodes. Class weights are an essential tuning parameter to achieve desired performance. The out-of-bag estimate of the accuracy from RF can be used to select weights.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Manish Prasadçš„æ›´å¤šæ–‡ç«

Case Study: Using LLM for Product Recommendations at â€œZonamaâ€

2024å¹´6æœˆ15æ—¥

Case Study: Using LLM for Product Recommendations at â€œZonamaâ€

Ever wondered how big online stores seem to know exactly what you want? Meet Zonama, a giant in the e-commerce worldâ€¦
Paper Made Easy: A guide to Hierarchical Classification

2020å¹´1æœˆ19æ—¥

Paper Made Easy: A guide to Hierarchical Classification

Flat Classification Hierarchical Classification A very large amount of research in the data mining, machine learningâ€¦

2 æ¡è¯„è®º
Machine Learning Basics : Scalars, Vectors, Matrices and Tensors

2020å¹´1æœˆ14æ—¥

Machine Learning Basics : Scalars, Vectors, Matrices and Tensors

Machine Learning involves several types of mathematical objects: Scalar A scalar is just a single number, in contrastâ€¦
Introduction to Reinforcement Learning : Trial and Error way of Learning

2019å¹´1æœˆ9æ—¥

Introduction to Reinforcement Learning : Trial and Error way of Learning

What is Reinforcement Learning? Reinforcement learning is learning of â€œwhat to do â€” how to map situations to actions â€”â€¦
Incremental learning algorithms and applications

2018å¹´11æœˆ20æ—¥

Incremental learning algorithms and applications

Incremental learning refers to learning from streaming data, which arrive over time, with limited memory resources and,â€¦

See all articles

Random Forest for Learning Imbalanced Data

Manish Prasad

Lead Data Scientist at Vista | VIT Alumnus | Causal Inference | Bayesian Statistics | NLP & Computer Vision |

Using Random Forest to Learn Imbalanced Data

Problem with existing Classification algorithms

Handling Imbalanced Data

Solving Data Imbalance using Random Forest

Random Forest

Balanced Random Forest

Balanced Random Forest (BRF) algorithm

Weighted Random Forest

Manish Prasadçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

12 Useful Things to Know about Machine Learning

Learning Machine as a service.

Machine Learning Journey

Exploring the Potential of Few-Shot Learning: Approaches and Implications

Decision Tree

Mastering Machine Learning: The Art of Random Forests

Exploring Instance-Based Learning: Insights from Tom M. Mitchellâ€™s "Machine Learning"

Machine Learning for Spray Analysis

Machine Learning: Choosing the Right Tool for the Job

Using Random Forest to Learn Imbalanced Data

Problem with existing Classification algorithms

Handling Imbalanced Data

Solving Data Imbalance using Random Forest

Random Forest

Balanced Random Forest

Balanced Random Forest (BRF) algorithm

Weighted Random Forest

Manish Prasadçš„æ›´å¤šæ–‡ç«

Case Study: Using LLM for Product Recommendations at â€œZonamaâ€

Paper Made Easy: A guide to Hierarchical Classification

Machine Learning Basics : Scalars, Vectors, Matrices and Tensors

Introduction to Reinforcement Learning : Trial and Error way of Learning

Incremental learning algorithms and applications

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

12 Useful Things to Know about Machine Learning

Learning Machine as a service.

Machine Learning Journey

Exploring the Potential of Few-Shot Learning: Approaches and Implications

Decision Tree

Mastering Machine Learning: The Art of Random Forests

Exploring Instance-Based Learning: Insights from Tom M. Mitchellâ€™s "Machine Learning"

Machine Learning for Spray Analysis

Machine Learning: Choosing the Right Tool for the Job

Case Study: Using LLM for Product Recommendations at â€œZonamaâ€

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†