登录查看更多内容

Handling class imbalance problem in machine learning

Deepak Kumar

Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing

发布日期: 2021年3月4日

Why to read this?

Classification is the most important task in inductive learning and machine learning. Think that a trained machine predicts wrong class. It is error. It affects accuracy of machine and also real-life impact of error can be high.

Class imbalanced dataset can be a cause of this error. If you are interested to know about all causes of such error and handling such cases, then this document helps you.

Technical explanation

The class imbalance problem in machine/statistical learning is the observation that some binary classification(*) algorithms do not perform well when the proportion of 0 classes to 1 classes is very skewed.

The class imbalanced datasets occurs in many real-world applications where the class distributions of data are highly imbalanced.

Reason of class imbalance

The imbalanced class problem becomes meaningful only if one or both of the two assumptions below are not true

class distribution imbalance - It happens if the class distribution in the test data is different from that of the training data. Consider the following example of a model(Refer below diagram) that detects fraud. Instances of fraud happen once per 200 transactions in this data set, so in the true distribution, about 0.5% of the data is positive. Notice the imbalance in these two classes (positive and negative).

Cost imbalance - It happens if the cost of different types of error (false positive and false negative in the binary classification) is not the same.For example, if carrying a bomb is positive, then it is much more expensive to miss a terrorist who carries a bomb to a flight than searching an innocent person.

Handling class distribution imbalance problem

With so few positives relative to negatives (in above diagram of fraud dataset), the training model will spend most of its time on negative examples and not learn enough from positive ones. For example, if your batch size is 128, many batches will have no positive examples, so the gradients will be less informative.

To solve this, downsampling and upsampling technique(Ref: below picture) helps. Here basically the idea is to increase the probability of lower size class.

Note that upsampling/downsampling approach can't be used if minor class size is too low(Refer outliers in below diagram). For these rare events, outlier detection algorithms are designed. This document talks about few of such algorithms.

Also you can refer my write-up here for detail on outlier detection approach.

Handling cost imbalance problem

Cost-sensitive Learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. The goal of cost-sensitive learning is to minimize the cost of a model on a training dataset.

In cost-sensitive learning instead of each instance being either correctly or incorrectly classified, each class (or instance) is given a misclassification cost. Thus, instead of trying to optimise the accuracy, the problem is then to minimise the total misclassification cost(Refer below table for 2 class model).

Note that here Cost refers to the penalty associated with an incorrect prediction as shown in above example[C(0,1) and C(1,0)].

For example, we might assign no cost to correct predictions in each class, a cost of 5 for False Positives and a cost of 88 for False Negatives.

| Actual Negative. | Actual Positive

Predicted Negative 0 88

Predicted Positive 5 0

Cost-Sensitive ML Algorithms

Machine learning algorithms are rarely developed specifically for cost-sensitive learning.

Instead, the wealth of existing machine learning algorithms can be modified to make use of the cost matrix.

For neural networks, the backpropagation algorithm can be updated to weigh misclassification errors in proportion to the importance of the class, referred to as weighted neural networks or cost-sensitive neural networks. This has the effect of allowing the model to pay more attention to examples from the minority class than the majority class in datasets with a severely skewed class distribution(Refer below diagram as hint, thanks to IEEE paper here).

Point to remember

There can be many types of cost as below

Cost of misclassification errors (or prediction errors more generally).
Cost of tests or evaluation.
Cost of teacher or labeling.
Cost of intervention or changing the system from which observations are drawn.
Cost of unwanted achievements or outcomes from intervening.
Cost of computation or computational complexity.
Cost of cases or data collection.
Cost of human-computer interaction or framing the problem and using software to fit and use a model.
Cost of instability or variance known as concept drift.

The above list highlights that the cost mentioned earlier for imbalanced-classification is just one type of the range of costs that the broader field of cost-sensitive learning might consider.

Reference

Thanks to these helping hands

https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem
https://cling.csd.uwo.ca/papers/cost_sensitive.pdf
https://machinelearningmastery.com/cost-sensitive-learning-for-imbalanced-classification/
https://images.app.goo.gl/MzfZ4mSvXfjZEoQx7
https://machinelearningmastery.com/cost-sensitive-neural-network-for-imbalanced-classification/
https://arxiv-export-lb.library.cornell.edu/pdf/1508.03422
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.8285&rep=rep1&type=pdf
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
https://ieeexplore.ieee.org/document/9189349
https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
https://arxiv-export-lb.library.cornell.edu/pdf/1801.10269
https://images.app.goo.gl/ED6tvGGMAmo6568y7
https://www.dhirubhai.net/posts/dpkumar_anomalydetection-cybersecurity-machinelearningmodels-activity-6767977267718184960-5xbh

Handling class imbalance problem in machine learning

Deepak Kumar

Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing

Cost-Sensitive ML Algorithms

更多精彩文章

社区洞察

其他会员也浏览了

Machine Learning Part 1 : What & Why ?

Regularization in Machine Learning

Machine Learning Techniques at a Glance

Machine Learning: The Modern Detective

machine learning

Understanding the Foundations of Machine Learning: Data, Model, Objective Function, and Gradient Descent Optimization

Machine Learning Key Concepts

Deciphering Machine Learning

Challenges in Batch Gradient Descent: A Deep Dive into Validation Errors

Machine Learning Overview

Cost-Sensitive ML Algorithms

Role of DBSCAN in machine learning

2023年12月21日

Choice between multithreading and multi-processing: When to use what

2023年12月20日

Artificial Narrow Intelligence

2023年12月18日

Federated learning and Vehicular IoT

2023年11月29日

An age old proven technique for image resizing

2023年7月14日

Stock Market Volatility Index

2023年7月12日

The case for De-normalisation in Machine learning

2023年7月8日

Kubernetes complements Meta-verse

2023年7月4日

Which one offers better Security- OSS or Proprietary software

2023年6月24日

Why chatGPT/LLM should have unlearning capability like human has..

2023年5月29日

社区洞察

其他会员也浏览了

Machine Learning Part 1 : What & Why ?

Regularization in Machine Learning

Machine Learning Techniques at a Glance

Machine Learning: The Modern Detective

machine learning

Understanding the Foundations of Machine Learning: Data, Model, Objective Function, and Gradient Descent Optimization

Machine Learning Key Concepts

Deciphering Machine Learning

Challenges in Batch Gradient Descent: A Deep Dive into Validation Errors

Machine Learning Overview