Ways to Detect and Remove the Outliers in the dataset.
Amit Kumar
AI Engineer | Gen AI | Agentic AI | LLM | RAG | Machine Learning | Computer Vision | NLP | Deep Learning |
While working on a Data Science project, what is it, that you look for? What is the most important part of the EDA phase? There are certain things which, if are not done in the EDA phase, can affect further statistical/Machine Learning modelling. One of them is finding “Outliersâ€. In this post we will try to understand what is an outlier? Why is it important to identify the outliers? What are the methods to outliers? Don’t worry, we won’t just go through the theory part but we will do some coding and plotting of the data too
What is an outlier?
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.
What are the criteria to identify an outlier?
- Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
- Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation
What is the reason for an outlier to exists in a dataset?
- Variability in the data
- An experimental measurement error
What are the impacts of having outliers in a dataset?
- It causes various problems during our statistical analysis
- It may cause a significant impact on the mean and the standard deviation
Various ways of finding the outlier.
- Using scatter plots
- Box plot
- using z score
- using the IQR interquartile range
# using box plot
import seaborn as sns
sns.boxplot(dataset,width=0.5)
Detecting outlier using Z score
- Using Z score
- Formula for Z score = (Observation — Mean)/Standard Deviation
z = (X — μ) / σ
outliers=[]
def detect_outliers(data):
threshold=3
mean = np.mean(data)
std =np.std(data)
for i in data:
z_score= (i - mean)/std
if np.abs(z_score) > threshold:
outliers.append(i)
return outliers
outlier_pt=detect_outliers(dataset)
outlier_pt
InterQuantile Range
75%- 25% values in a dataset
Steps
- Arrange the data in increasing order
- Calculate first(q1) and third quartile(q3)
- Find interquartile range (q3-q1)
- Find lower bound q1*1.5
- Find upper bound q3*1.5
Anything that lies outside of lower and upper bound is an outlier
## Perform all the steps of IQR
sorted(dataset)
quantile1, quantile3= np.percentile(dataset,[25,75])
print(quantile1,quantile3)
## Find the IQR
iqr_value=quantile3-quantile1
print(iqr_value)
## Find the lower bound value and the higher bound value
lower_bound_val = quantile1 -(1.5 * iqr_value)
upper_bound_val = quantile3 +(1.5 * iqr_value)
print(lower_bound_val,upper_bound_val)
# Any value that lies outside of lower and upper bound is an outlier
@Product (CSPO?)@Baker Hughes
4 å¹´Really useful ??