ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Ways to Detect and Remove the Outliers in the dataset.

Amit Kumar

AI Engineer | Gen AI | Agentic AI | LLM | RAG | Machine Learning | Computer Vision | NLP | Deep Learning |

å‘å¸ƒæ—¥æœŸ: 2020å¹´8æœˆ13æ—¥

While working on a Data Science project, what is it, that you look for? What is the most important part of the EDA phase? There are certain things which, if are not done in the EDA phase, can affect further statistical/Machine Learning modelling. One of them is finding â€œOutliersâ€. In this post we will try to understand what is an outlier? Why is it important to identify the outliers? What are the methods to outliers? Donâ€™t worry, we wonâ€™t just go through the theory part but we will do some coding and plotting of the data too

What is an outlier?

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

What are the criteria to identify an outlier?

Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

What is the reason for an outlier to exists in a dataset?

Variability in the data
An experimental measurement error

What are the impacts of having outliers in a dataset?

It causes various problems during our statistical analysis
It may cause a significant impact on the mean and the standard deviation

Various ways of finding the outlier.

Using scatter plots
Box plot
using z score
using the IQR interquartile range

# using box plot

import seaborn as sns

sns.boxplot(dataset,width=0.5)

Detecting outlier using Z score

Using Z score
Formula for Z score = (Observation â€” Mean)/Standard Deviation

z = (X â€” Î¼) / Ïƒ

outliers=[]

def detect_outliers(data):

threshold=3

mean = np.mean(data)

std =np.std(data)

for i in data:

z_score= (i - mean)/std

if np.abs(z_score) > threshold:

outliers.append(i)

return outliers

outlier_pt=detect_outliers(dataset)

outlier_pt

InterQuantile Range

75%- 25% values in a dataset

Steps

Arrange the data in increasing order
Calculate first(q1) and third quartile(q3)
Find interquartile range (q3-q1)
Find lower bound q1*1.5
Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

## Perform all the steps of IQR

sorted(dataset)

quantile1, quantile3= np.percentile(dataset,[25,75])

print(quantile1,quantile3)

## Find the IQR

iqr_value=quantile3-quantile1

print(iqr_value)

## Find the lower bound value and the higher bound value

lower_bound_val = quantile1 -(1.5 * iqr_value)

upper_bound_val = quantile3 +(1.5 * iqr_value)

print(lower_bound_val,upper_bound_val)

# Any value that lies outside of lower and upper bound is an outlier

Sishir Kumar

@Product (CSPO?)@Baker Hughes

4 å¹´

Really useful ??

èµž

å›žå¤

2 æ¬¡å›žåº”

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Amit Kumarçš„æ›´å¤šæ–‡ç«

How to transform data to better fit the Gaussian Distribution

2020å¹´8æœˆ15æ—¥

How to transform data to better fit the Gaussian Distribution

we are going to see the various types of transformations of data to better fit for normal distribution (Gaussianâ€¦
Ordinal Categorical Encoding or Label Encoding

2020å¹´8æœˆ14æ—¥

Ordinal Categorical Encoding or Label Encoding

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and theâ€¦
P2-Prediction of Google App's Ratings

2020å¹´4æœˆ13æ—¥

P2-Prediction of Google App's Ratings

In this project, I have done the data analysis of Google App's rating. The data is taken from the kaggle.
Data Science-Project-Visualize the Coronavirus Pandemic

2020å¹´4æœˆ6æ—¥

Data Science-Project-Visualize the Coronavirus Pandemic

I believe that it is the most effective method to illustrate and explain complex information, especially numericalâ€¦
Lean Thinking: Banish Waste and Create Wealth in Your Corporation

2017å¹´8æœˆ4æ—¥

Lean Thinking: Banish Waste and Create Wealth in Your Corporation

I think this book is an excellent read for all of you. I do know that some of you do not have the time or resources toâ€¦
Stay positive ,don't let a layoff lay you low

2017å¹´3æœˆ21æ—¥

Stay positive ,don't let a layoff lay you low

laid off workers need to realise they have been handed out pink slips due to the difficult market conditions and notâ€¦
This is a country of colors and faith. Let this Republic Day remind us of the work done by our leaders for our safe & happy lives .Happy Republic Day

2017å¹´1æœˆ26æ—¥

This is a country of colors and faith. Let this Republic Day remind us of the work done by our leaders for our safe & happy lives .Happy Republic Day

It's that time of the year when citizens get together to celebrate the country. This year marks our nation's 68thâ€¦

See all articles

Ways to Detect and Remove the Outliers in the dataset.

Amit Kumar

AI Engineer | Gen AI | Agentic AI | LLM | RAG | Machine Learning | Computer Vision | NLP | Deep Learning |

What is an outlier?

What are the criteria to identify an outlier?

What is the reason for an outlier to exists in a dataset?

What are the impacts of having outliers in a dataset?

Various ways of finding the outlier.

sns.boxplot(dataset,width=0.5)

Detecting outlier using Z score

InterQuantile Range

Amit Kumarçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Robust Data Models: Building Resilient Systems Against Outliers

Decision Tree Classification

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

People are catching up with post-deployment data science

Statistical Distributions: Types and Importance.

Scatter Charts in Focus â€” A Comprehensive Guide to Effective Visualization

How to Model Shared and Local Data Viewpoints using SHACL Ontologies

Validation and Evaluation of Model Drift

Data Wrangling in R

Quantico: Forecasting Panel & Single Series Data

What is an outlier?

What are the criteria to identify an outlier?

What is the reason for an outlier to exists in a dataset?

What are the impacts of having outliers in a dataset?

Various ways of finding the outlier.

sns.boxplot(dataset,width=0.5)

Detecting outlier using Z score

InterQuantile Range

Amit Kumarçš„æ›´å¤šæ–‡ç«

How to transform data to better fit the Gaussian Distribution

Ordinal Categorical Encoding or Label Encoding

P2-Prediction of Google App's Ratings

Data Science-Project-Visualize the Coronavirus Pandemic

Lean Thinking: Banish Waste and Create Wealth in Your Corporation

Stay positive ,don't let a layoff lay you low

This is a country of colors and faith. Let this Republic Day remind us of the work done by our leaders for our safe & happy lives .Happy Republic Day

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Robust Data Models: Building Resilient Systems Against Outliers

Decision Tree Classification

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

People are catching up with post-deployment data science

Statistical Distributions: Types and Importance.

Scatter Charts in Focus â€” A Comprehensive Guide to Effective Visualization

How to Model Shared and Local Data Viewpoints using SHACL Ontologies

Validation and Evaluation of Model Drift

Data Wrangling in R

Quantico: Forecasting Panel & Single Series Data

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†