Handling class imbalance problem in machine learning
Deepak Kumar
Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing
Why to read this?
Classification is the most important task in inductive learning and machine learning. Think that a trained machine predicts wrong class. It is error. It affects accuracy of machine and also real-life impact of error can be high.
Class imbalanced dataset can be a cause of this error. If you are interested to know about all causes of such error and handling such cases, then this document helps you.
Technical explanation
The class imbalance problem in machine/statistical learning is the observation that some binary classification(*) algorithms do not perform well when the proportion of 0 classes to 1 classes is very skewed.
The class imbalanced datasets occurs in many real-world applications where the class distributions of data are highly imbalanced.
Reason of class imbalance
The imbalanced class problem becomes meaningful only if one or both of the two assumptions below are not true
- class distribution imbalance - It happens if the class distribution in the test data is different from that of the training data. Consider the following example of a model(Refer below diagram) that detects fraud. Instances of fraud happen once per 200 transactions in this data set, so in the true distribution, about 0.5% of the data is positive. Notice the imbalance in these two classes (positive and negative).
- Cost imbalance - It happens if the cost of different types of error (false positive and false negative in the binary classification) is not the same.For example, if carrying a bomb is positive, then it is much more expensive to miss a terrorist who carries a bomb to a flight than searching an innocent person.
Handling class distribution imbalance problem
With so few positives relative to negatives (in above diagram of fraud dataset), the training model will spend most of its time on negative examples and not learn enough from positive ones. For example, if your batch size is 128, many batches will have no positive examples, so the gradients will be less informative.
To solve this, downsampling and upsampling technique(Ref: below picture) helps. Here basically the idea is to increase the probability of lower size class.
Note that upsampling/downsampling approach can't be used if minor class size is too low(Refer outliers in below diagram). For these rare events, outlier detection algorithms are designed. This document talks about few of such algorithms.
Also you can refer my write-up here for detail on outlier detection approach.
Handling cost imbalance problem
Cost-sensitive Learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. The goal of cost-sensitive learning is to minimize the cost of a model on a training dataset.
In cost-sensitive learning instead of each instance being either correctly or incorrectly classified, each class (or instance) is given a misclassification cost. Thus, instead of trying to optimise the accuracy, the problem is then to minimise the total misclassification cost(Refer below table for 2 class model).
Note that here Cost refers to the penalty associated with an incorrect prediction as shown in above example[C(0,1) and C(1,0)].
For example, we might assign no cost to correct predictions in each class, a cost of 5 for False Positives and a cost of 88 for False Negatives.
| Actual Negative. | Actual Positive
Predicted Negative 0 88
Predicted Positive 5 0
Cost-Sensitive ML Algorithms
Machine learning algorithms are rarely developed specifically for cost-sensitive learning.
Instead, the wealth of existing machine learning algorithms can be modified to make use of the cost matrix.
For neural networks, the backpropagation algorithm can be updated to weigh misclassification errors in proportion to the importance of the class, referred to as weighted neural networks or cost-sensitive neural networks. This has the effect of allowing the model to pay more attention to examples from the minority class than the majority class in datasets with a severely skewed class distribution(Refer below diagram as hint, thanks to IEEE paper here).
Point to remember
There can be many types of cost as below
Cost of misclassification errors (or prediction errors more generally). Cost of tests or evaluation. Cost of teacher or labeling. Cost of intervention or changing the system from which observations are drawn. Cost of unwanted achievements or outcomes from intervening. Cost of computation or computational complexity. Cost of cases or data collection. Cost of human-computer interaction or framing the problem and using software to fit and use a model. Cost of instability or variance known as concept drift.
The above list highlights that the cost mentioned earlier for imbalanced-classification is just one type of the range of costs that the broader field of cost-sensitive learning might consider.
Reference
Thanks to these helping hands
https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem https://cling.csd.uwo.ca/papers/cost_sensitive.pdf https://machinelearningmastery.com/cost-sensitive-learning-for-imbalanced-classification/ https://images.app.goo.gl/MzfZ4mSvXfjZEoQx7 https://machinelearningmastery.com/cost-sensitive-neural-network-for-imbalanced-classification/ https://arxiv-export-lb.library.cornell.edu/pdf/1508.03422 https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.8285&rep=rep1&type=pdf https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ https://ieeexplore.ieee.org/document/9189349 https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data https://arxiv-export-lb.library.cornell.edu/pdf/1801.10269 https://images.app.goo.gl/ED6tvGGMAmo6568y7 https://www.dhirubhai.net/posts/dpkumar_anomalydetection-cybersecurity-machinelearningmodels-activity-6767977267718184960-5xbh