What is SMOTE, how it is helping Citizen Machine Learning Consultants?
Picture Credit: https://www.turing.com/

What is SMOTE, how it is helping Citizen Machine Learning Consultants?

#SMOTE #dataimbalance #dataanalysis #datascientists #machinelearning #artificialintelligence #iot #ADASYN #imbalance #syntheticdata #dataclassification

What is imbalanced data?

When referring to classification jobs where the classes are not equally represented, imbalanced data is frequently used.

As an illustration, suppose you have a binary classification problem with 100 occurrences, 80 of which are classified as Class-1, and the remaining 20, as Class-2.

The ratio of Class-1 to Class-2 instances in this dataset is 4:1, making it an example of an unbalanced dataset.

Before supplying the training data as input to the machine learning algorithm, one method for dealing with imbalanced datasets is to balance classes in the training data (basically a data preprocessing step). The latter method is favored because of its wider applicability and adaptability. Additionally, it often takes longer to improve an algorithm than it does to produce the necessary samples. However, both are preferred for research. We will go into great detail about the SMOTE technique in this article.?

No alt text provided for this image

What is SMOTE?

Synthetic Minority Oversampling Technique is referred to as SMOTE. In a 2002 article published in the Journal of Artificial Intelligence Research, the strategy was suggested. SMOTE is a better approach for handling unbalanced data in classification issues. When observed frequencies of a categorical variable vary significantly among its conceivable values, the data is said to be imbalanced. In general, there are lots of observations of one kind and few of another. Consider a data set regarding the sales of a brand-new extreme sports product as an illustration. Let's say that the website sells to two different types of customers to keep things simple: hikers and sky divers. We also keep track of each visitor's purchase of the new mountain goods. Imagine that we want to create a classification model that enables us to forecast if a visitor will purchase a new product using customer data. Only a small portion of visitors to e-commerce sites actually make purchases; the majority of visitors only browse merchandise. Due to the large number of non-buyers and the limited number of buyers, our data set will be unbalanced.

No alt text provided for this image

What is the problem with imbalanced Data?

You can see from the data example that we've had 30 website visitors. Ten are hikers and twenty are sky divers. Building a machine learning model that can forecast whether a visitor will make a purchase is the aim.

The only independent variable in this case is whether the visitor is a Sky Diver or a Hiker. Let's explore two incredibly basic models as a thought experiment:

  • A simulation that incorporates the variable "Sky Diver vs Hiker"
  • A model without the "Sky Diver vs Hiker" variable

I won't get into the details of various machine learning methods here; instead, let's examine the validity of using the independent variable to predict purchasers logically.

Only 5% of Sky Divers purchase, compared to 10% of Hikers. This information suggests that Hikers are more likely to purchase than Sky Divers. This, however, has no bearing on the model's ability to predict whether a visitor will "buy" or "not buy."

The only thing a model could actually accomplish in this situation is predict "not buy" for each of the 30 participants, dividing them into buyers and non-buyers. Sky Divers are less likely to purchase than they are to do so. Also, hikers are less inclined to make purchases. The sole choice here is to predict "not buy" for each individual.

The problematic part of this situation is that the model's prediction of "not buy" in each case is accurate in 28 out of 30 instances. This translates to a conversion accuracy of 28 out of 30, or 93%! We just created a model that appears to be very accurate but is actually useless due to imbalanced data!

How the SMOTE algorithm works?

SMOTE is an algorithm that adds artificial data points to the actual data points to accomplish data augmentation. SMOTE can be viewed as an improved form of oversampling or as a particular data augmentation procedure. With SMOTE, you avoid producing duplicate data points and instead produce synthetic data points that are marginally different from the original data points.

Following is how the SMOTE algorithm operates:

  • A random sample is chosen from the minority class.
  • Find the k nearest neighbors for the observations in this sample.
  • The next step is to select one of those neighbors and determine the direction of the relationship between the current data point and the chosen neighbor.
  • You add a random number between 0 and 1 to the vector.
  • You combine this with the current data point to produce the synthetic data point.

Actually, this procedure resembles the data point being slightly moved in the direction of a neighbor. By doing this, you can ensure that your synthetic data point does not exactly duplicate an existing data point and that it also does not deviate significantly from the known observations in your minority class.

The formation of the minority classes forms the core of SMOTE. The construction algorithm's logic is straightforward. You have already studied how oversampling leads to overfitting and how repeated occurrences tighten the decision boundary. Instead of repeating the same samples, what if you could create new ones? It has been demonstrated in the original SMOTE paper (linked above) that these newly constructed instances are not exact replicas to a machine learning algorithm, softening the decision boundary and assisting the algorithm to more closely approximate the hypothesis.

No alt text provided for this image

However, there are some advantages and disadvantages of SMOTE-

Advantages?-

  • As synthetic examples rather than replicas of instances are generated, overfitting due to random oversampling is reduced.
  • There is no information loss.
  • It is easy to use and comprehend.

Improvement areas in SMOTE:

  • SMOTE does not account for the possibility that nearby examples may be from different classes when generating synthetic examples. As a result, there may be more class overlap and noise introduced. SMOTE is not very useful for data with high dimensions.

Over Sampling Algorithms based on SMOTE

1-SMOTE: Artificial Minority KNN technique is used by the over sampling Technique (SMOTE) algorithm, which chooses K nearest neighbors, combines them, and generates synthetic samples in the space. The algorithm calculates the distance between the feature vectors and their closest neighbors. The difference is added back to the feature after being multiplied by a random number between (0, 1). Many other algorithms have been derived from the pioneering SMOTE algorithm.

2- ADASYN:?The concept behind ADAptive SYNthetic (ADASYN) is to use K nearest neighbors to generate minority data samples in accordance with their distributions. Without making any assumptions about the underlying distribution of the data, the algorithm adaptively changes the distribution. For the KNN Algorithm, Euclidean distance is used. The primary distinction between ADASYN and SMOTE is that the former employs a density distribution as a criterion to automatically determine the number of synthetic samples that need to be generated for each minority sample by adaptively adjusting the weights of the various minority samples to account for the skewed distributions. For each original minority sample, the latter produces an equal number of synthetic samples.

3- ANS: The number of neighbors required for oversampling around various minority regions can be dynamically adjusted using Adaptive Neighbor Synthetic (ANS). For a dataset, this algorithm removes the SMOTE parameter K and assigns a different number of neighbors to each positive instance. This technique is parameter free because every parameter is automatically set within the algorithm.

4- Border SMOTE: The synthetic sample is produced by borderline-SMOTE along the minority and majority class boundaries. Additionally, it facilitates the division of the minority and majority classes.

5-Safe Level SMOTE: The quantity of a positive occurrences in the k closest neighbors is referred to as the safe level. When an instance's safe level is so close to 0, it is almost noise. The instance is regarded as secure if it is near to k. By taking into account the safe level ratio of instances, each synthetic instance is generated in a secure place. SMOTE and Borderline-SMOTE, in contrast, might produce artificial instances in undesirable places, like overlapping regions and noise regions.

6- DBSMOTE: DBSCAN, a clustering method, provides the foundation of the density-based synthetic minority over-sampling technique. The DBSCAN Algorithm finds the clusters. DBSMOTE creates synthetic instances by finding the shortest route between each positive instance and the fictitious centre of a minority-class cluster.

?Conclusion

As a solution for imbalanced data in classification problems, we have found the SMOTE algorithm. SMOTE is a clever substitute for oversampling because it generates synthetic data points that resemble the original ones rather than duplicating the minority class.


Source:

Datacamp

Towardsdatascience

要查看或添加评论,请登录

Arivukkarasan Raja, PhD的更多文章

社区洞察

其他会员也浏览了