Handling Imbalanced Datasets by Oversampling and Undersampling with Python Implementation
https://datascience.aero/predicting-improbable-part-1-imbalanced-data-problem/

Handling Imbalanced Datasets by Oversampling and Undersampling with Python Implementation

What is Imbalanced data?

In general, most of the classification type datasets will have highly skewed or biased data. That is, there will be a majority class and a minority class. This is generally is called imbalanced dataset. The ratio of these classes is called the imbalanced ratio.

Issue with Imbalanced data

The output determined by working with such imbalanced data is biased to the majority class which has a higher number of examples. Greater the imbalanced ratio the output is more biased to the majority class. And about 99% of the samples in the dataset will not represent the minority class and the resulting performance measure accuracy will be very high. Hence in order to get unbiased results we need to balance out the dataset.

Source :

Handling Biased or Imbalanced Dataset

We need to first balance the dataset. In order to do so, the resampling technique is commonly used to reduce the bias. There are two ways of resmapling

  1. Oversampling (Increasing the data points Minority class)
  2. Undersmapling ( Reducing the data points in Majority class)

In oversampling the simplest way to increase the minority class is by duplication of the random data points, which may lead to overfitting.

In undersampling the simplest way to decrease the majority class is by removing the random data points which may lead to loss of relevant information.

Source:

Let us see the implementation of both using python code. We require a python package imbalanced-learn which has to be installed.

pip install imbalanced-learn        

Oversampling Methods

SMOTE (Synthetic Minority Oversampling TEchnique)

  • In this method we synthesize the data points of the minority class in accordance to those already present.
  • It takes a point in random and finds the k-nearest neighbours for that point.
  • After which the synthetic data points generated are added in between the random point chosen and its corresponding neighbour

Source :

Undersampling Methods

Tomek Links- Tlinks

  • This method also uses the k-nearest neighbours classifer. The algorithm finds the instances of opposite classes who are their own close neighbours.
  • Then the algorithm removes the majority instance of the pair. This method also clears the border between the majority and minority classes.

Source :

Cluster Centroid based undersampling

  • This method uses the KMeans algorithm. The algorithm indentifies a homogenous cluster of majority data points and replaces then by the cluster centriod.
  • The new majority samples are the N cluster centroids which is got by fitting the Kmeans algorithm with N clusters.


Source:

Python Implementation

Here we will use the pima indian diabetes dataset from Kaggle for seeing both the implemenation of SMOTE and RandomUnderSampler ( It involves sampling any random class with or without any replacement.)

Imports

# To help with reading and manipulation of data
import numpy as np
from numpy import array
import pandas as pd

# To split the data
from sklearn.model_selection import train_test_split

# To build a decision tree model
from sklearn.tree import DecisionTreeClassifier

# To get different performance metrics
import sklearn.metrics as metrics
from sklearn.metrics import recall_score,
   
# To undersample and oversample the data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler        

Load and view Data

X= pdata.drop(["class"],axis=1)
y= pdata["class"]
# splitting the dataset
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=1)
# Seeing the value counts in minority class
print(y_train.value_counts(1))
print("*" * 80)
print(y_test.value_counts(1))
print("*" * 80)        


Splitting the Dataset and viewing the counts

X= pdata.drop(["class"],axis=1)
y= pdata["class"]
# splitting the dataset
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=1)
# Seeing the value counts in minority class
print(y_train.value_counts(1))
print("*" * 80)
print(y_test.value_counts(1))
print("*" * 80)        


Building a normal Decision Tree Model and checking the Recall score

dtree2 = DecisionTreeClassifier(random_state=1, max_depth=4)
# training the decision tree model with oversampled training set
dtree2.fit(X_train_un, y_train_un)
# Predicting the target for train and validation set
pred_train = dtree2.predict(X_train_un)
pred_test = dtree2.predict(X_test)
# Checking recall score on oversampled train and validation set
print(recall_score(y_train_un, pred_train))
print(recall_score(y_test, pred_test))        

Oversampling train data using SMOTE

sm = SMOTE( sampling_strategy="auto", random_state=1, k_neighbors=5)
X_train_over , y_train_over = sm.fit_resample(X_train,y_train)
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))

print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))

print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))        

Building a Decision Tree Model with Oversampled Data and checking the Recall score

dtree2 = DecisionTreeClassifier(random_state=1, max_depth=4)
# training the decision tree model with oversampled training set
dtree2.fit(X_train_un, y_train_un)
# Predicting the target for train and validation set
pred_train = dtree2.predict(X_train_un)
pred_test = dtree2.predict(X_test)
# Checking recall score on oversampled train and validation set
print(recall_score(y_train_un, pred_train))
print(recall_score(y_test, pred_test))        

Undersampling train data using RandomUnderSampler

rm = RandomUnderSampler(sampling_strategy=1,
    random_state=1)
X_train_un,y_train_un = rm.fit_resample(X_train,y_train)
print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, count of label '1': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))        

Building a Decision Tree Model with Undersampled Data and checking the Recall score


dtree2 = DecisionTreeClassifier(random_state=1, max_depth=4)
# training the decision tree model with oversampled training set
dtree2.fit(X_train_un, y_train_un)
# Predicting the target for train and validation set
pred_train = dtree2.predict(X_train_un)
pred_test = dtree2.predict(X_test)
# Checking recall score on oversampled train and validation set
print(recall_score(y_train_un, pred_train))
print(recall_score(y_test, pred_test))        

Observations

We can see a change in the recall score after oversampling and undersampling of the data.

The undersampled version gave the best Recall score when compared to the remaining two decision tree models.

Further performance can be improved by using hyperparameter tuning.

Summary

Thus we saw how Imbalanced data can be handled by oversampling and undersampling of datasets along with the implementations.

Thanks for your time.

References:

https://datascience.aero/predicting-improbable-part-1-imbalanced-data-problem/

https://www.geeksforgeeks.org/imbalanced-learn-module-in-python/?source=post_page-----247eaa6e0426--------------------------------

https://www.kdnuggets.com/2016/08/learning-from-imbalanced-classes.html/2?source=post_page-----247eaa6e0426--------------------------------


要查看或添加评论,请登录

社区洞察

其他会员也浏览了