登录查看更多内容

What is SMOTE, how it is helping Citizen Machine Learning Consultants?

Arivukkarasan Raja, PhD

IT Director @ AstraZeneca | Expert in Enterprise Solution Architecture & Applied AI | Robotics & IoT | Digital Transformation | Strategic Vision for Business Growth Through Emerging Tech

发布日期: 2022年11月27日

#SMOTE #dataimbalance #dataanalysis #datascientists #machinelearning #artificialintelligence #iot #ADASYN #imbalance #syntheticdata #dataclassification

What is imbalanced data?

When referring to classification jobs where the classes are not equally represented, imbalanced data is frequently used.

As an illustration, suppose you have a binary classification problem with 100 occurrences, 80 of which are classified as Class-1, and the remaining 20, as Class-2.

The ratio of Class-1 to Class-2 instances in this dataset is 4:1, making it an example of an unbalanced dataset.

Before supplying the training data as input to the machine learning algorithm, one method for dealing with imbalanced datasets is to balance classes in the training data (basically a data preprocessing step). The latter method is favored because of its wider applicability and adaptability. Additionally, it often takes longer to improve an algorithm than it does to produce the necessary samples. However, both are preferred for research. We will go into great detail about the SMOTE technique in this article.?

What is SMOTE?

Synthetic Minority Oversampling Technique is referred to as SMOTE. In a 2002 article published in the Journal of Artificial Intelligence Research, the strategy was suggested. SMOTE is a better approach for handling unbalanced data in classification issues. When observed frequencies of a categorical variable vary significantly among its conceivable values, the data is said to be imbalanced. In general, there are lots of observations of one kind and few of another. Consider a data set regarding the sales of a brand-new extreme sports product as an illustration. Let's say that the website sells to two different types of customers to keep things simple: hikers and sky divers. We also keep track of each visitor's purchase of the new mountain goods. Imagine that we want to create a classification model that enables us to forecast if a visitor will purchase a new product using customer data. Only a small portion of visitors to e-commerce sites actually make purchases; the majority of visitors only browse merchandise. Due to the large number of non-buyers and the limited number of buyers, our data set will be unbalanced.

What is the problem with imbalanced Data?

You can see from the data example that we've had 30 website visitors. Ten are hikers and twenty are sky divers. Building a machine learning model that can forecast whether a visitor will make a purchase is the aim.

The only independent variable in this case is whether the visitor is a Sky Diver or a Hiker. Let's explore two incredibly basic models as a thought experiment:

A simulation that incorporates the variable "Sky Diver vs Hiker"
A model without the "Sky Diver vs Hiker" variable

I won't get into the details of various machine learning methods here; instead, let's examine the validity of using the independent variable to predict purchasers logically.

Only 5% of Sky Divers purchase, compared to 10% of Hikers. This information suggests that Hikers are more likely to purchase than Sky Divers. This, however, has no bearing on the model's ability to predict whether a visitor will "buy" or "not buy."

The only thing a model could actually accomplish in this situation is predict "not buy" for each of the 30 participants, dividing them into buyers and non-buyers. Sky Divers are less likely to purchase than they are to do so. Also, hikers are less inclined to make purchases. The sole choice here is to predict "not buy" for each individual.

The problematic part of this situation is that the model's prediction of "not buy" in each case is accurate in 28 out of 30 instances. This translates to a conversion accuracy of 28 out of 30, or 93%! We just created a model that appears to be very accurate but is actually useless due to imbalanced data!

How the SMOTE algorithm works?

SMOTE is an algorithm that adds artificial data points to the actual data points to accomplish data augmentation. SMOTE can be viewed as an improved form of oversampling or as a particular data augmentation procedure. With SMOTE, you avoid producing duplicate data points and instead produce synthetic data points that are marginally different from the original data points.

Following is how the SMOTE algorithm operates:

A random sample is chosen from the minority class.
Find the k nearest neighbors for the observations in this sample.
The next step is to select one of those neighbors and determine the direction of the relationship between the current data point and the chosen neighbor.
You add a random number between 0 and 1 to the vector.
You combine this with the current data point to produce the synthetic data point.

领英推荐

Boom in Data Center Construction driven by AI

Serverwala Cloud Data Centers Pvt. Ltd. 9 个月前

Next up from Nixtla? A newsletter, new releases, and…

Nixtla 3 个月前

A Conversion with AI about Futuristic Scanner for…

Aurora Consulting 9 个月前

Actually, this procedure resembles the data point being slightly moved in the direction of a neighbor. By doing this, you can ensure that your synthetic data point does not exactly duplicate an existing data point and that it also does not deviate significantly from the known observations in your minority class.

The formation of the minority classes forms the core of SMOTE. The construction algorithm's logic is straightforward. You have already studied how oversampling leads to overfitting and how repeated occurrences tighten the decision boundary. Instead of repeating the same samples, what if you could create new ones? It has been demonstrated in the original SMOTE paper (linked above) that these newly constructed instances are not exact replicas to a machine learning algorithm, softening the decision boundary and assisting the algorithm to more closely approximate the hypothesis.

However, there are some advantages and disadvantages of SMOTE-

Advantages?-

As synthetic examples rather than replicas of instances are generated, overfitting due to random oversampling is reduced.
There is no information loss.
It is easy to use and comprehend.

Improvement areas in SMOTE:

SMOTE does not account for the possibility that nearby examples may be from different classes when generating synthetic examples. As a result, there may be more class overlap and noise introduced. SMOTE is not very useful for data with high dimensions.

Over Sampling Algorithms based on SMOTE

1-SMOTE: Artificial Minority KNN technique is used by the over sampling Technique (SMOTE) algorithm, which chooses K nearest neighbors, combines them, and generates synthetic samples in the space. The algorithm calculates the distance between the feature vectors and their closest neighbors. The difference is added back to the feature after being multiplied by a random number between (0, 1). Many other algorithms have been derived from the pioneering SMOTE algorithm.

2- ADASYN:?The concept behind ADAptive SYNthetic (ADASYN) is to use K nearest neighbors to generate minority data samples in accordance with their distributions. Without making any assumptions about the underlying distribution of the data, the algorithm adaptively changes the distribution. For the KNN Algorithm, Euclidean distance is used. The primary distinction between ADASYN and SMOTE is that the former employs a density distribution as a criterion to automatically determine the number of synthetic samples that need to be generated for each minority sample by adaptively adjusting the weights of the various minority samples to account for the skewed distributions. For each original minority sample, the latter produces an equal number of synthetic samples.

3- ANS: The number of neighbors required for oversampling around various minority regions can be dynamically adjusted using Adaptive Neighbor Synthetic (ANS). For a dataset, this algorithm removes the SMOTE parameter K and assigns a different number of neighbors to each positive instance. This technique is parameter free because every parameter is automatically set within the algorithm.

4- Border SMOTE: The synthetic sample is produced by borderline-SMOTE along the minority and majority class boundaries. Additionally, it facilitates the division of the minority and majority classes.

5-Safe Level SMOTE: The quantity of a positive occurrences in the k closest neighbors is referred to as the safe level. When an instance's safe level is so close to 0, it is almost noise. The instance is regarded as secure if it is near to k. By taking into account the safe level ratio of instances, each synthetic instance is generated in a secure place. SMOTE and Borderline-SMOTE, in contrast, might produce artificial instances in undesirable places, like overlapping regions and noise regions.

6- DBSMOTE: DBSCAN, a clustering method, provides the foundation of the density-based synthetic minority over-sampling technique. The DBSCAN Algorithm finds the clusters. DBSMOTE creates synthetic instances by finding the shortest route between each positive instance and the fictitious centre of a minority-class cluster.

?Conclusion

As a solution for imbalanced data in classification problems, we have found the SMOTE algorithm. SMOTE is a clever substitute for oversampling because it generates synthetic data points that resemble the original ones rather than duplicating the minority class.

Source:

Datacamp

Towardsdatascience

要查看或添加评论，请登录

Arivukkarasan Raja, PhD的更多文章

The Dawn of Distributed Intelligence: Edge AI Integration with Agentic AI

2025年3月1日

The Dawn of Distributed Intelligence: Edge AI Integration with Agentic AI

The field of artificial intelligence is currently experiencing a significant transformation. We are transitioning from…
Decoding the Future: AI Agents vs. Agentic AI - Navigating the Nuances

2025年2月22日

Decoding the Future: AI Agents vs. Agentic AI - Navigating the Nuances

The field of Artificial Intelligence is undergoing a rapid transformation, with the emergence of new technologies and…

28 条评论
Bridging the Babel: Achieving Semantic Interoperability with Agentic AI

2025年2月15日

Bridging the Babel: Achieving Semantic Interoperability with Agentic AI

The emergence of Agentic AI, which involves autonomous agents operating and interacting within intricate systems…

2 条评论
Engineering the Future: Unleashing Innovation with Generative Design and Optimization ??

2025年2月8日

Engineering the Future: Unleashing Innovation with Generative Design and Optimization ??

Introduction: The Dawn of Intelligent Design The field of engineering is currently experiencing a significant…

4 条评论
Decoding DeepSeek: A Deep Dive into its Architecture, Capabilities, and Practical Applications

2025年2月1日

Decoding DeepSeek: A Deep Dive into its Architecture, Capabilities, and Practical Applications

New architectures and capabilities are emerging at an astonishing pace, and the world of Large Language Models (LLMs)…

2 条评论
Hybrid Intelligence in Agentic AI: Unleashing the Power of Human-Machine Collaboration

2025年1月25日

Hybrid Intelligence in Agentic AI: Unleashing the Power of Human-Machine Collaboration

Artificial Intelligence (AI) has evolved from task-specific tools to systems with agentic capabilities, which can…

4 条评论
When Agentic AI Meets Robotics: The Dawn of a New Industrial Era

2025年1月18日

When Agentic AI Meets Robotics: The Dawn of a New Industrial Era

The convergence of Agentic AI and Robotics is transforming industries by enabling autonomous decision-making and…

9 条评论
What is Agentic AI, and its Architecture, how it can help Software professionals?

2025年1月11日

What is Agentic AI, and its Architecture, how it can help Software professionals?

Introduction Agentic AI is a rapidly evolving field of artificial intelligence that focuses on creating autonomous…

14 条评论
What is Cloud Robotics, and How Generative AI Can Integrate with It?

2025年1月4日

What is Cloud Robotics, and How Generative AI Can Integrate with It?

Introduction Cloud Robotics and Generative AI are revolutionizing the way we interact with automation. The convergence…
Jailbreaking Large Language Models (LLMs): Risks, Challenges, and Responsible AI Development

2024年12月28日

Jailbreaking Large Language Models (LLMs): Risks, Challenges, and Responsible AI Development

Introduction Large Language Models (LLMs) have revolutionized industries like customer service and creative content…

6 条评论

See all articles

What is SMOTE, how it is helping Citizen Machine Learning Consultants?

Arivukkarasan Raja, PhD

IT Director @ AstraZeneca | Expert in Enterprise Solution Architecture & Applied AI | Robotics & IoT | Digital Transformation | Strategic Vision for Business Growth Through Emerging Tech

领英推荐

Arivukkarasan Raja, PhD的更多文章

社区洞察

其他会员也浏览了

What a Fish Dinner Has in Common with Emerging Technologies

How Artificial Intelligence And Big Data Are Changing Engineering Forever

Data Analytics: Charting the Course for 2023 and Beyond

Why Your #Industrial Machines Need to #Talk: Unleashing the Power of #MachineLearning for Smarter #Asset #Management

Do I Need a Data Strategy or an AI Strategy – or Both?

Internet of Everything and Machine Learning Applications: Issues and Challenges

The Convergence of AI, IoT, and Machine Learning: Shaping the Future of Technology

Data Gaze Thursday: 5.0 + Data Literacy

The Anatomy of a Digital Twin: Understanding Its Components

TinyML and Small Data: Revolutionizing Machine Learning on Low-Powered Devices.

领英推荐

Arivukkarasan Raja, PhD的更多文章

The Dawn of Distributed Intelligence: Edge AI Integration with Agentic AI

Decoding the Future: AI Agents vs. Agentic AI - Navigating the Nuances

Bridging the Babel: Achieving Semantic Interoperability with Agentic AI

Engineering the Future: Unleashing Innovation with Generative Design and Optimization ??

Decoding DeepSeek: A Deep Dive into its Architecture, Capabilities, and Practical Applications

Hybrid Intelligence in Agentic AI: Unleashing the Power of Human-Machine Collaboration

When Agentic AI Meets Robotics: The Dawn of a New Industrial Era

What is Agentic AI, and its Architecture, how it can help Software professionals?

What is Cloud Robotics, and How Generative AI Can Integrate with It?

Jailbreaking Large Language Models (LLMs): Risks, Challenges, and Responsible AI Development

社区洞察

其他会员也浏览了

What a Fish Dinner Has in Common with Emerging Technologies

How Artificial Intelligence And Big Data Are Changing Engineering Forever

Data Analytics: Charting the Course for 2023 and Beyond

Why Your #Industrial Machines Need to #Talk: Unleashing the Power of #MachineLearning for Smarter #Asset #Management

Do I Need a Data Strategy or an AI Strategy – or Both?

Internet of Everything and Machine Learning Applications: Issues and Challenges

The Convergence of AI, IoT, and Machine Learning: Shaping the Future of Technology

Data Gaze Thursday: 5.0 + Data Literacy

The Anatomy of a Digital Twin: Understanding Its Components

TinyML and Small Data: Revolutionizing Machine Learning on Low-Powered Devices.