登录查看更多内容

How can you evaluate Machine Learning models with imbalanced data?

由人工智能和领英社区提供技术支持

Imbalanced data is a common challenge in Machine Learning, especially for classification problems. It means that some classes are underrepresented or overrepresented in the training data, which can affect the performance and accuracy of the model. How can you evaluate Machine Learning models with imbalanced data? In this article, you will learn some tips and techniques to measure and improve your model's performance on imbalanced data.

此文章中的业界达人

由社区从 5 条内容中精选。了解更多

Roland Tetteh

Business Analyst at GroupM Canada | Data Scientist | Machine Learning | Big Data | Statistical Analysis
Sanjay Lalwani

Data Scientist at Siemens | Speaker | Ex-Infoscian | AI Educator | Azure*2

1 Why is imbalanced data a problem?

Imbalanced data can cause several issues for Machine Learning models, such as bias, variance, and generalization. Bias may lead to low recall or sensitivity; variance may result in low precision or specificity; and generalization can cause low accuracy or F1-score. To address these issues, it is necessary to use evaluation metrics that accurately reflect the performance of your model on imbalanced data.

添加您的观点

Sanjay Lalwani

Data Scientist at Siemens | Speaker | Ex-Infoscian | AI Educator | Azure*2
举报内容
- Imbalanced data set leads not identifying minority classes accurately, in most of the cases finding minority classes is the goal. This leads to lower accuracy, and not reaching to business goal.

已翻译

赞

2 What are some evaluation metrics for imbalanced data?

The most common evaluation metrics for classification problems are accuracy, precision, recall, and F1-score. However, these metrics may not be suitable for imbalanced data, as they can be misleading or insensitive to the minority class. For instance, accuracy can be high even if the model predicts the majority class for all instances, while precision and recall can vary depending on the threshold or cutoff value. F1-score is a harmonic mean of precision and recall, but it may not capture the trade-off between them adequately. Alternative evaluation metrics for imbalanced data include ROC curve and AUC, which plots the true positive rate (TPR) versus the false positive rate (FPR) for different threshold values and measures the area under the curve respectively. A higher AUC indicates a better model that can distinguish between the classes. Precision-recall curve and average precision is another metric that plots the precision versus the recall for different threshold values and measures the area under the curve respectively. A higher average precision indicates a better model that can predict the minority class accurately. Additionally, Cohen's kappa is a measure of agreement between the model's predictions and the actual labels, adjusted for chance. A higher kappa indicates a better model that can predict both classes reliably.

添加您的观点

Sanjay Lalwani

Data Scientist at Siemens | Speaker | Ex-Infoscian | AI Educator | Azure*2
举报内容
Based on problem, we can define custom evaluation metrics, where not identifying minority class could be penalized heavily, thus finding minority class could be priority

已翻译

赞
Roland Tetteh

Business Analyst at GroupM Canada | Data Scientist | Machine Learning | Big Data | Statistical Analysis
举报内容
By using into the ROC Curve, you can customize the probability threshold to match the acceptable false positive rate for a specific use case. This way, you pinpoint the best true positive rate for your needs, rather than sticking with the default threshold value of 0.5.

已翻译

赞

3 How can you balance the data?

To address imbalanced data, resampling techniques can be implemented. These can be divided into oversampling and undersampling, with the former creating synthetic instances of the minority class and the latter removing some instances of the majority class. Random oversampling and undersampling randomly select instances from the minority or majority class, but may introduce noise or reduce diversity. SMOTE (Synthetic Minority Oversampling Technique) creates new synthetic instances of the minority class by interpolating between existing ones, but can generate unrealistic or noisy instances. Tomek links and ENN (Edited Nearest Neighbors) remove instances of the majority class that are close to or misclassified by the minority class, based on the nearest neighbors; however, this may remove useful or informative instances that are relevant to the problem.

添加您的观点

Roland Tetteh

Business Analyst at GroupM Canada | Data Scientist | Machine Learning | Big Data | Statistical Analysis
举报内容
Instead of performing one of the resampling techniques (oversampling or undersampling), sometimes a hybrid approach can improve model result. That is oversampling the minority class slightly and then slightly undersampling the majority class until you reach a balanced class ratio.

已翻译

赞

4 How can you adjust the model?

Another way to deal with imbalanced data is to adjust the model by modifying the learning algorithm or the objective function. Class weights, cost-sensitive learning, and ensemble learning are some common methods for adjusting the model. Class weights assigns different weights to different classes, based on their frequency or importance. Cost-sensitive learning assigns different costs to different types of errors, such as false positives and false negatives. Ensemble learning combines multiple models, such as decision trees or neural networks, to create a more robust and accurate model. However, these methods may require manual tuning, estimating costs from domain knowledge or business objectives, or increasing complexity and computational cost of the model, respectively.

添加您的观点

5 How can you test the model?

The final step in evaluating Machine Learning models with imbalanced data is to test the model on a separate and representative dataset. The test dataset should reflect the true distribution and characteristics of the data, such as the class imbalance, the noise, and the outliers. The test dataset should also be large enough and diverse enough to capture the variability and uncertainty of the data. The test dataset should not be used for training or tuning the model, to avoid overfitting or leakage. The test dataset should be used to measure the performance of the model using the chosen evaluation metrics, and to compare the results with other models or baselines.

添加您的观点

Roland Tetteh

Business Analyst at GroupM Canada | Data Scientist | Machine Learning | Big Data | Statistical Analysis
举报内容
In my experience, resampling techniques works well with some models (eg: neural networks) but no so well with others (eg: decision trees). It may be best to train models with both the original and resampled data to compare the performance.

已翻译

赞

6 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

添加您的观点

Machine Learning

+ 关注

给文章评分

我们借助人工智能创建了此文章。您认为这篇文章怎么样？

很棒不太好

举报此文章

查看全部

How can you evaluate Machine Learning models with imbalanced data?

1

2

3

4

5

6

1 Why is imbalanced data a problem?

2 What are some evaluation metrics for imbalanced data?

3 How can you balance the data?

4 How can you adjust the model?

5 How can you test the model?

6 Here’s what else to consider

Machine Learning

给文章评分

感谢您的反馈

更多Machine Learning相关文章

更多相关阅读内容

How can you evaluate Machine Learning models with imbalanced data?

1

2

3

4

5

6

1 Why is imbalanced data a problem?

2 What are some evaluation metrics for imbalanced data?

3 How can you balance the data?

4 How can you adjust the model?

5 How can you test the model?

6 Here’s what else to consider

Machine Learning

给文章评分

感谢您的反馈

查看其他技能