Random Forest for Learning Imbalanced Data

Random Forest for Learning Imbalanced Data

Using Random Forest to Learn Imbalanced Data

The most common difficulties while working on Classification is imbalanced data.

Many practical classification problems are imbalanced; i.e., at least one of the classes constitutes only a very small minority of the data. For such problems, the interest usually leans towards the correct classification of the “rare” class (i.e. the “positive” class). Examples of such problems include fraud detection, network intrusion, rare disease diagnosing, etc.

Problem with existing Classification algorithms

The most commonly used classification algorithms do not work well for Imbalanced data problems because they aim to minimize the overall error rate, rather than paying special attention to the positive class.

Handling Imbalanced Data

  • Cost-Sensitive Learning: Assigning a high cost to misclassification of the minority class, and minimizing the overall cost.
  • sampling technique: Either down-sampling the majority class or over-sampling the minority class, or both.

Solving Data Imbalance using Random Forest

we will use the above two techniques for solving Imbalanced using Random Forest.

  • Incorporating class weights into the RF classifier, thus making it cost sensitive, and it penalizes misclassifying the minority class
  • combining the sampling technique and the ensemble idea. It down-samples the majority class and grows each tree on a more balanced data set.

Random Forest

Random forest (Breiman, 2001) is an ensemble of unpruned classification or regression trees, induced from bootstrap samples of the training data, using random feature selection in the tree induction process. Prediction is made by aggregating (majority vote for classification or averaging for regression) the predictions of the ensemble. Random forest generally exhibits a substantial performance improvement over the single tree classifier such as CART and C4.5.

However, similar to most classifiers, RF can also suffer from the curse of learning from an extremely imbalanced training data set. As it is constructed to minimize the overall error rate, it will tend to focus more on the prediction accuracy of the majority class, which often results in poor accuracy for the minority class.

To handle the Imbalanced data using Random Forest, we can use two variation of Random Forest:

Balanced Random Forest

Random Forest induces each constituent tree from a bootstrap sample of the training data. In learning extremely imbalanced data, there is a significant probability that a bootstrap sample contains few or even none of the minority class, resulting in a tree with poor performance for predicting the minority class.

A naive way of fixing this problem is to use a stratified bootstrap; i.e., the sample with replacement from within each class. This still does not solve the imbalance problem entirely.

Over-Sampling v/s Down-Sampling

Making class priors equal either by down-sampling the majority class or over-sampling the minority class is usually more effective with respect to given performance measurement, and that downsampling seems to have an edge over over-sampling. However, down-sampling the majority class may result in a loss of information, as a large part of the majority class is not used.

The Balanced Random Forest (BRF) algorithm is to induce ensemble trees on balanced down-sampled data.

Balanced Random Forest (BRF) algorithm

  1. For each iteration in Random Forest, draw a bootstrap sample from the minority class. Randomly draw the same number of cases, with replacement, from the majority class.
  2. Induce a classification tree from the data to maximum size, without pruning. The tree is induced with the CART algorithm, with the following modification: At each node, instead of searching through all variables for the optimal split, only search through a set of m randomly selected variables.
  3. Repeat the two steps above for the number of times desired. Aggregate the predictions of the ensemble and make the final prediction.

Weighted Random Forest

Another approach to make Random Forest more suitable for learning from extremely imbalanced data follows the idea of cost-sensitive learning.

Since the RF classifier tends to be biased towards the majority class, we shall place a heavier penalty on misclassifying the minority class. We assign a weight to each class, with the minority class given larger weight (i.e., higher misclassification cost).

The class weights are incorporated into the RF algorithm in two places:

  • In the tree induction procedure, class weights are used to weight the Gini criterion for finding splits.
  • In the terminal nodes of each tree, class weights are again taken into consideration. The class prediction of each terminal node is determined by “weighted majority vote”; i.e., the weighted vote of a class is the weight for that class times the number of cases for that class at the terminal node.

The final class prediction for RF is then determined by aggregating the weighted vote from each individual tree, where the weights are average weights in the terminal nodes. Class weights are an essential tuning parameter to achieve desired performance. The out-of-bag estimate of the accuracy from RF can be used to select weights.

要查看或添加评论,请登录

Manish Prasad的更多文章

社区洞察

其他会员也浏览了