Handling class imbalance problem in machine learning
https://images.app.goo.gl/MzfZ4mSvXfjZEoQx7

Handling class imbalance problem in machine learning

Why to read this?

Classification is the most important task in inductive learning and machine learning. Think that a trained machine predicts wrong class. It is error. It affects accuracy of machine and also real-life impact of error can be high. 

Class imbalanced dataset can be a cause of this error. If you are interested to know about all causes of such error and handling such cases, then this document helps you.

Technical explanation

The class imbalance problem in machine/statistical learning is the observation that some binary classification(*) algorithms do not perform well when the proportion of 0 classes to 1 classes is very skewed.

The class imbalanced datasets occurs in many real-world applications where the class distributions of data are highly imbalanced.


Reason of class imbalance

The imbalanced class problem becomes meaningful only if one or both of the two assumptions below are not true

  • class distribution imbalance - It happens if the class distribution in the test data is different from that of the training data. Consider the following example of a model(Refer below diagram) that detects fraud. Instances of fraud happen once per 200 transactions in this data set, so in the true distribution, about 0.5% of the data is positive. Notice the imbalance in these two classes (positive and negative).
No alt text provided for this image
  • Cost imbalance - It happens if the cost of different types of error (false positive and false negative in the binary classification) is not the same.For example, if carrying a bomb is positive, then it is much more expensive to miss a terrorist who carries a bomb to a flight than searching an innocent person. 


Handling class distribution imbalance problem

With so few positives relative to negatives (in above diagram of fraud dataset), the training model will spend most of its time on negative examples and not learn enough from positive ones. For example, if your batch size is 128, many batches will have no positive examples, so the gradients will be less informative.

To solve this, downsampling and upsampling technique(Ref: below picture) helps. Here basically the idea is to increase the probability of lower size class. 

No alt text provided for this image

Note that upsampling/downsampling approach can't be used if minor class size is too low(Refer outliers in below diagram). For these rare events, outlier detection algorithms are designed. This document talks about few of such algorithms.

Also you can refer my write-up here for detail on outlier detection approach.

No alt text provided for this image

 

Handling cost imbalance problem 

Cost-sensitive Learning  is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. The goal of cost-sensitive learning is to minimize the cost of a model on a training dataset.

In cost-sensitive learning instead of each instance being either correctly or incorrectly classified, each class (or instance) is given a misclassification cost. Thus, instead of trying to optimise the accuracy, the problem is then to minimise the total misclassification cost(Refer below table for 2 class model).

No alt text provided for this image

Note that here Cost refers to the penalty associated with an incorrect prediction as shown in above example[C(0,1) and C(1,0)].

For example, we might assign no cost to correct predictions in each class, a cost of 5 for False Positives and a cost of 88 for False Negatives.

                  | Actual Negative. | Actual Positive

Predicted Negative 0              88

Predicted Positive 5              0


Cost-Sensitive ML Algorithms

Machine learning algorithms are rarely developed specifically for cost-sensitive learning.

Instead, the wealth of existing machine learning algorithms can be modified to make use of the cost matrix.

For neural networks, the backpropagation algorithm can be updated to weigh misclassification errors in proportion to the importance of the class, referred to as weighted neural networks or cost-sensitive neural networks. This has the effect of allowing the model to pay more attention to examples from the minority class than the majority class in datasets with a severely skewed class distribution(Refer below diagram as hint, thanks to IEEE paper here).

No alt text provided for this image


Point to remember


There can be many types of cost as below

Cost of misclassification errors (or prediction errors more generally).
Cost of tests or evaluation.
Cost of teacher or labeling.
Cost of intervention or changing the system from which observations are drawn.
Cost of unwanted achievements or outcomes from intervening.
Cost of computation or computational complexity.
Cost of cases or data collection.
Cost of human-computer interaction or framing the problem and using software to fit and use a model.
Cost of instability or variance known as concept drift.

The above list highlights that the cost mentioned earlier for imbalanced-classification is just one type of the range of costs that the broader field of cost-sensitive learning might consider.


Reference
Thanks to these helping hands
https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem
https://cling.csd.uwo.ca/papers/cost_sensitive.pdf
https://machinelearningmastery.com/cost-sensitive-learning-for-imbalanced-classification/
https://images.app.goo.gl/MzfZ4mSvXfjZEoQx7
https://machinelearningmastery.com/cost-sensitive-neural-network-for-imbalanced-classification/
https://arxiv-export-lb.library.cornell.edu/pdf/1508.03422
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.8285&rep=rep1&type=pdf
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
https://ieeexplore.ieee.org/document/9189349
https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
https://arxiv-export-lb.library.cornell.edu/pdf/1801.10269
https://images.app.goo.gl/ED6tvGGMAmo6568y7
https://www.dhirubhai.net/posts/dpkumar_anomalydetection-cybersecurity-machinelearningmodels-activity-6767977267718184960-5xbh

要查看或添加评论,请登录

社区洞察

其他会员也浏览了