Biofloc Fish Farming in Odisha apply online.REGISTER NOW GET FREE 888 PESOS REWARDS!

What is Imbalanced data?

In general, most of the classification type datasets will have highly skewed or biased data. That is, there will be a majority class and a minority class. This is generally is called imbalanced dataset. The ratio of these classes is called the imbalanced ratio.

Issue with Imbalanced data

The output determined by working with such imbalanced data is biased to the majority class which has a higher number of examples. Greater the imbalanced ratio the output is more biased to the majority class. And about 99% of the samples in the dataset will not represent the minority class and the resulting performance measure accuracy will be very high. Hence in order to get unbiased results we need to balance out the dataset.

Handling Biased or Imbalanced Dataset

We need to first balance the dataset. In order to do so, the resampling technique is commonly used to reduce the bias. There are two ways of resmapling

Oversampling (Increasing the data points Minority class)
Undersmapling ( Reducing the data points in Majority class)

In oversampling the simplest way to increase the minority class is by duplication of the random data points, which may lead to overfitting.

In undersampling the simplest way to decrease the majority class is by removing the random data points which may lead to loss of relevant information.

Let us see the implementation of both using python code. We require a python package imbalanced-learn which has to be installed.

pip install imbalanced-learn

Oversampling Methods

SMOTE (Synthetic Minority Oversampling TEchnique)

In this method we synthesize the data points of the minority class in accordance to those already present.
It takes a point in random and finds the k-nearest neighbours for that point.
After which the synthetic data points generated are added in between the random point chosen and its corresponding neighbour

Undersampling Methods

Tomek Links- Tlinks

This method also uses the k-nearest neighbours classifer. The algorithm finds the instances of opposite classes who are their own close neighbours.
Then the algorithm removes the majority instance of the pair. This method also clears the border between the majority and minority classes.

Cluster Centroid based undersampling

This method uses the KMeans algorithm. The algorithm indentifies a homogenous cluster of majority data points and replaces then by the cluster centriod.
The new majority samples are the N cluster centroids which is got by fitting the Kmeans algorithm with N clusters.

Python Implementation

Here we will use the pima indian diabetes dataset from Kaggle for seeing both the implemenation of SMOTE and RandomUnderSampler ( It involves sampling any random class with or without any replacement.)

Imports

# To help with reading and manipulation of data
import numpy as np
from numpy import array
import pandas as pd

# To split the data
from sklearn.model_selection import train_test_split

# To build a decision tree model
from sklearn.tree import DecisionTreeClassifier

# To get different performance metrics
import sklearn.metrics as metrics
from sklearn.metrics import recall_score,
   
# To undersample and oversample the data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

Load and view Data

X= pdata.drop(["class"],axis=1)
y= pdata["class"]
# splitting the dataset
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=1)
# Seeing the value counts in minority class
print(y_train.value_counts(1))
print("*" * 80)
print(y_test.value_counts(1))
print("*" * 80)

Splitting the Dataset and viewing the counts

X= pdata.drop(["class"],axis=1)
y= pdata["class"]
# splitting the dataset
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=1)
# Seeing the value counts in minority class
print(y_train.value_counts(1))
print("*" * 80)
print(y_test.value_counts(1))
print("*" * 80)

Building a normal Decision Tree Model and checking the Recall score

dtree2 = DecisionTreeClassifier(random_state=1, max_depth=4)
# training the decision tree model with oversampled training set
dtree2.fit(X_train_un, y_train_un)
# Predicting the target for train and validation set
pred_train = dtree2.predict(X_train_un)
pred_test = dtree2.predict(X_test)
# Checking recall score on oversampled train and validation set
print(recall_score(y_train_un, pred_train))
print(recall_score(y_test, pred_test))

Oversampling train data using SMOTE

sm = SMOTE( sampling_strategy="auto", random_state=1, k_neighbors=5)
X_train_over , y_train_over = sm.fit_resample(X_train,y_train)
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))

print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))

print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))

Building a Decision Tree Model with Oversampled Data and checking the Recall score

dtree2 = DecisionTreeClassifier(random_state=1, max_depth=4)
# training the decision tree model with oversampled training set
dtree2.fit(X_train_un, y_train_un)
# Predicting the target for train and validation set
pred_train = dtree2.predict(X_train_un)
pred_test = dtree2.predict(X_test)
# Checking recall score on oversampled train and validation set
print(recall_score(y_train_un, pred_train))
print(recall_score(y_test, pred_test))

Undersampling train data using RandomUnderSampler

rm = RandomUnderSampler(sampling_strategy=1,
    random_state=1)
X_train_un,y_train_un = rm.fit_resample(X_train,y_train)
print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, count of label '1': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))

Building a Decision Tree Model with Undersampled Data and checking the Recall score


dtree2 = DecisionTreeClassifier(random_state=1, max_depth=4)
# training the decision tree model with oversampled training set
dtree2.fit(X_train_un, y_train_un)
# Predicting the target for train and validation set
pred_train = dtree2.predict(X_train_un)
pred_test = dtree2.predict(X_test)
# Checking recall score on oversampled train and validation set
print(recall_score(y_train_un, pred_train))
print(recall_score(y_test, pred_test))

Observations

We can see a change in the recall score after oversampling and undersampling of the data.

The undersampled version gave the best Recall score when compared to the remaining two decision tree models.

Further performance can be improved by using hyperparameter tuning.

Summary

Thus we saw how Imbalanced data can be handled by oversampling and undersampling of datasets along with the implementations.

Thanks for your time.

References:

https://datascience.aero/predicting-improbable-part-1-imbalanced-data-problem/

https://www.geeksforgeeks.org/imbalanced-learn-module-in-python/?source=post_page-----247eaa6e0426--------------------------------

https://www.kdnuggets.com/2016/08/learning-from-imbalanced-classes.html/2?source=post_page-----247eaa6e0426--------------------------------

Handling Imbalanced Datasets by Oversampling and Undersampling with Python Implementation

Lakshmi Prabha Ramesh

IoT Operations Management| Machine Learning | Data Science | Cybersecurity

What is Imbalanced data?

Handling Biased or Imbalanced Dataset

Oversampling Methods

SMOTE (Synthetic Minority Oversampling TEchnique)

Undersampling Methods

Tomek Links- Tlinks

Cluster Centroid based undersampling

Python Implementation

Imports

领英推荐

Undersampling train data using RandomUnderSampler

Observations

Summary

References:

更多精彩文章

社区洞察

其他会员也浏览了

Understanding the capabilities of Polars Python implementation

Python Basics for Data Science

Complex: Python With A Real And Imaginary Number

JSON Parsing with Python | Scrape Parse Data Python

Top 10 Ways to deal with Missing Values in Python

Python Residual Sum Of Squares: Tutorial & Examples

The Complete Guide To Time Series Analysis With Python.

Python Data Types: A Deep Dive for Experienced Developers

My Python Joy: A World Without Tables and Calculators? (#05)

?????? # 4 ???????????????????? ?????? ?????????? ???? ????????????: Basic Data Types in Python

What is Imbalanced data?

Handling Biased or Imbalanced Dataset

Oversampling Methods

SMOTE (Synthetic Minority Oversampling TEchnique)

Undersampling Methods

Tomek Links- Tlinks

Cluster Centroid based undersampling

Python Implementation

Imports

领英推荐

Undersampling train data using RandomUnderSampler

Observations

Summary

References:

Choosing the right chart for your initial univariate, bivariate and multivariate data analysis.Make the right decision on visualizing your data!

2024年1月22日

Evaluate your Regression model like a Pro! Understand these Evaluation Metrics and make faster decisions on your models.

2024年1月17日

AIMS Grid - A bulls eye Project Management Tool for Data Analysts, Team Leaders and Directors.

2023年3月23日

A Useful Google Chrome Extension- AI powered HyperWrite- A good friend for all who write.

2023年3月22日

社区洞察

其他会员也浏览了

Understanding the capabilities of Polars Python implementation

Python Basics for Data Science

Complex: Python With A Real And Imaginary Number

JSON Parsing with Python | Scrape Parse Data Python

Top 10 Ways to deal with Missing Values in Python

Python Residual Sum Of Squares: Tutorial & Examples

The Complete Guide To Time Series Analysis With Python.

Python Data Types: A Deep Dive for Experienced Developers

My Python Joy: A World Without Tables and Calculators? (#05)

?????? # 4 ???????????????????? ?????? ?????????? ???? ????????????: Basic Data Types in Python