登录查看更多内容

What are the best techniques for undersampling in machine learning?

由人工智能和领英社区提供技术支持

Undersampling is a data cleaning technique for dealing with imbalanced data in machine learning. Imbalanced data means that some classes or labels are overrepresented or underrepresented in the dataset, which can affect the performance and accuracy of the models. Undersampling aims to reduce the size of the majority class by randomly or strategically removing some samples, so that the classes become more balanced. In this article, you will learn about some of the best techniques for undersampling in machine learning and how to apply them in your projects.

本文章的要点总结

Cluster-based undersampling:

This technique preserves data quality while balancing classes through clustering algorithms. It groups similar samples, then picks representatives from each cluster, ensuring diversity in your dataset.
Edited nearest neighbors:

This method improves data consistency by removing misclassified samples. By comparing a sample to its neighbors' classes, you can eliminate outliers and errors, refining your machine learning model’s accuracy.

本摘要由 AI 和以下专家提供支持

1 Random undersampling

Random undersampling is the simplest and most common technique for undersampling. It involves randomly selecting and discarding samples from the majority class until the desired class balance is achieved. Random undersampling can be easily implemented using libraries like sklearn or imbalanced-learn in Python. For example, you can use the RandomUnderSampler class from imbalanced-learn to perform random undersampling on your data:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

However, random undersampling has some drawbacks, such as losing potentially useful information and increasing the risk of overfitting.

添加您的观点

Sanjay Kumar MBA,MS,PhD
举报内容
Random undersampling is a commonly used technique to address class imbalance in datasets. It involves randomly removing samples from the majority class until a desired balance is achieved between classes. This method is straightforward to implement using libraries like scikit-learn or imbalanced-learn in Python. However, random undersampling has drawbacks, such as the potential loss of valuable information and an increased risk of overfitting. Careful consideration is needed to determine its suitability for a specific machine learning task.

已翻译

赞
Dang Nguyen

AI Enthusiast
举报内容
Idea: randomly select samples from the larger distribution and remove them until all class distributions are equal. Pros: easy to implement and test Cons: good data samples can be removed leading to a reduction in the overall classification performance of the majority class.

已翻译

赞

2 Cluster-based undersampling

Cluster-based undersampling is a more sophisticated technique that uses clustering algorithms to group similar samples from the majority class and then select representative samples from each cluster. This way, the diversity and quality of the data are preserved, while reducing the size of the majority class. Cluster-based undersampling can also be implemented using imbalanced-learn, by using the ClusterCentroids class, which uses K-means clustering by default. For example, you can use the following code to perform cluster-based undersampling on your data:

from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=42)
X_resampled, y_resampled = cc.fit_resample(X, y)

Cluster-based undersampling can improve the performance of the models, but it can also be computationally expensive and sensitive to the choice of the clustering algorithm and parameters.

添加您的观点

Sanjay Kumar MBA,MS,PhD
举报内容
Cluster-based undersampling is an advanced technique for addressing class imbalance in datasets. It involves using clustering algorithms to group similar samples from the majority class and then selecting representative samples from each cluster. This approach preserves data diversity and quality while reducing the size of the majority class. You can implement cluster-based undersampling using libraries like imbalanced-learn. However, cluster-based undersampling may be computationally expensive and sensitive to the choice of clustering algorithm and parameters. It can improve model performance but requires careful consideration and experimentation to determine its suitability for a specific machine learning task

已翻译

赞
Dang Nguyen

AI Enthusiast
举报内容
Idea: Select only the representative samples from the larger distribution by using clustering algorithms on the majority class, ex: a representative for a cluster from k-means can be its centroid. Pros: easy to implement and test Cons: dependent on the clustering algorithms can affect the downstream classifiers performance.

已翻译

赞

3 Tomek links undersampling

Tomek links undersampling is another technique that aims to improve the quality of the data by removing noisy and borderline samples from the majority class. Tomek links are pairs of samples from different classes that are closest to each other in the feature space. By removing the samples from the majority class that form Tomek links, the boundaries between the classes become clearer and the data becomes more separable. Tomek links undersampling can also be done using imbalanced-learn, by using the TomekLinks class. For example, you can use the following code to perform Tomek links undersampling on your data:

from imblearn.under_sampling import TomekLinks
tl = TomekLinks()
X_resampled, y_resampled = tl.fit_resample(X, y)

Tomek links undersampling can enhance the data quality and reduce the overlapping between the classes, but it may not change the class balance significantly, especially if there are few Tomek links in the data.

添加您的观点

Dang Nguyen

AI Enthusiast
举报内容
Idea: using KNN to identiy the nearest neighbor of each samples and remove samples from the larger distribution if these samples has the nearest neighbor from smaller distribution. Pros: easy to implement and test. Cons: this method assume that the removed samples are noises with no justification.

已翻译

赞

4 NearMiss undersampling

NearMiss undersampling is another technique that uses the distance between the samples to select the ones to keep or discard from the majority class. There are three versions of NearMiss undersampling, which differ in the way they calculate the distance. NearMiss-1 selects the samples from the majority class that are closest to the k nearest neighbors from the minority class. NearMiss-2 selects the samples from the majority class that are farthest from the k farthest neighbors from the minority class. NearMiss-3 selects the samples from the majority class that have the smallest average distance to k nearest neighbors from the minority class. NearMiss undersampling can also be done using imbalanced-learn, by using the NearMiss class and specifying the version parameter. For example, you can use the following code to perform NearMiss-1 undersampling on your data:

from imblearn.under_sampling import NearMiss
nm = NearMiss(version=1)
X_resampled, y_resampled = nm.fit_resample(X, y)

NearMiss undersampling can create a more balanced dataset, but it can also introduce bias and reduce the diversity of the data, as it only considers the nearest neighbors and not the global distribution of the data.

添加您的观点

Sanjay Kumar MBA,MS,PhD
举报内容
NearMiss undersampling is a technique for mitigating class imbalance by selecting which samples from the majority class to keep or discard based on their proximity to the minority class. There are three versions of NearMiss: NearMiss-1, NearMiss-2, and NearMiss-3, each calculating distances differently. NearMiss-1 retains majority class samples that are closest to the k nearest neighbors of the minority class. NearMiss-2 retains majority class samples farthest from the k farthest neighbors of the minority class. NearMiss-3 retains majority class samples with the smallest average distance to the k nearest neighbors of the minority class.

已翻译

赞

5 Edited nearest neighbors undersampling

Edited nearest neighbors undersampling is another technique that uses the nearest neighbors to remove noisy and misclassified samples from the majority class. It works by comparing the class of each sample from the majority class with the class of its k nearest neighbors. If the class of the sample does not agree with the majority of its neighbors, it is removed from the dataset. This way, the outliers and errors in the data are eliminated, and the data becomes more consistent and reliable. Edited nearest neighbors undersampling can also be done using imbalanced-learn, by using the EditedNearestNeighbours class. For example, you can use the following code to perform edited nearest neighbors undersampling on your data:

from imblearn.under_sampling import EditedNearestNeighbours
enn = EditedNearestNeighbours()
X_resampled, y_resampled = enn.fit_resample(X, y)

Edited nearest neighbors undersampling can improve the data quality and reduce the noise and misclassification in the data, but it may not change the class balance significantly, especially if there are few noisy samples in the data.

添加您的观点

6 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

添加您的观点

Machine Learning

+ 关注

给文章评分

我们借助人工智能创建了此文章。您认为这篇文章怎么样？

很棒不太好

举报此文章

查看全部

What are the best techniques for undersampling in machine learning?

1

2

3

4

5

6

1 Random undersampling

2 Cluster-based undersampling

3 Tomek links undersampling

4 NearMiss undersampling

5 Edited nearest neighbors undersampling

6 Here’s what else to consider

Machine Learning

给文章评分

感谢您的反馈

更多Machine Learning相关文章

更多相关阅读内容

What are the best techniques for undersampling in machine learning?

1

2

3

4

5

6

1 Random undersampling

2 Cluster-based undersampling

3 Tomek links undersampling

4 NearMiss undersampling

5 Edited nearest neighbors undersampling

6 Here’s what else to consider

Machine Learning

给文章评分

感谢您的反馈

查看其他技能