Ways to Detect and Remove the Outliers in the dataset.

Ways to Detect and Remove the Outliers in the dataset.

While working on a Data Science project, what is it, that you look for? What is the most important part of the EDA phase? There are certain things which, if are not done in the EDA phase, can affect further statistical/Machine Learning modelling. One of them is finding “Outliers”. In this post we will try to understand what is an outlier? Why is it important to identify the outliers? What are the methods to outliers? Don’t worry, we won’t just go through the theory part but we will do some coding and plotting of the data too

What is an outlier?

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.


What are the criteria to identify an outlier?

  • Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
  • Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

What is the reason for an outlier to exists in a dataset?

  • Variability in the data
  • An experimental measurement error

What are the impacts of having outliers in a dataset?

  • It causes various problems during our statistical analysis
  • It may cause a significant impact on the mean and the standard deviation

Various ways of finding the outlier.

  • Using scatter plots
  • Box plot
  • using z score
  • using the IQR interquartile range

# using box plot

import seaborn as sns

sns.boxplot(dataset,width=0.5)

No alt text provided for this image

Detecting outlier using Z score

  • Using Z score
  • Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

outliers=[]

def detect_outliers(data):

   threshold=3

   mean = np.mean(data)

   std =np.std(data)

    for i in data:

       z_score= (i - mean)/std

       if np.abs(z_score) > threshold:

           outliers.append(i)

   return outliers

outlier_pt=detect_outliers(dataset)

outlier_pt

InterQuantile Range

75%- 25% values in a dataset

Steps

  1. Arrange the data in increasing order
  2. Calculate first(q1) and third quartile(q3)
  3. Find interquartile range (q3-q1)
  4. Find lower bound q1*1.5
  5. Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

## Perform all the steps of IQR

sorted(dataset)

quantile1, quantile3= np.percentile(dataset,[25,75])

print(quantile1,quantile3)

## Find the IQR

iqr_value=quantile3-quantile1

print(iqr_value)

## Find the lower bound value and the higher bound value

lower_bound_val = quantile1 -(1.5 * iqr_value) 

upper_bound_val = quantile3 +(1.5 * iqr_value) 

print(lower_bound_val,upper_bound_val)

# Any value that lies outside of lower and upper bound is an outlier




Sishir Kumar

@Product (CSPO?)@Baker Hughes

4 å¹´

Really useful ??

要查看或添加评论,请登录

Amit Kumar的更多文章

社区洞察

其他会员也浏览了