登录查看更多内容

Ways to detect outliers that every data scientist should know

Amir T.

Marketing Products Senior Data Analyst

发布日期: 2022年11月26日

An outlier is a data point that differs significantly from other observations. Outliers distort the feature distribution and ML work significantly, therefore we need to observe and form a strategy to deal with them.

How does an outlier pop up?

The appearance of such observations can be caused by:

Differences in measurement methods, for example, the sensitivity of the sensor has changed;
Experimental errors, where outliers may be the result of an error during data collection;
Introduction of new processes;
An error during the data collection stage, or data handling;
or an indicator of variances in observations.

Depending on the nature of outliers, you may keep them or exclude them, e.g. in the case of experimental errors you would like to remove them.

What are the types of outliers??There are 3 types of outliers:

Global:?Also known as Point Outlier. This observation extends far beyond the entire dataset. For example: in a class, all students are the same age, but there is a record about a student aged 500 years.

. Conditional: Observations are considered anomalous given the context. For example, the economic performance of a country falls sharply due to the global economic crisis, and for some time lower rates become the norm.

3. Collective:?a set of observations that are close to each other and have close anomalous values. A subset of points is considered anomalous if these values as an aggregate deviate significantly from the entire data set, but the values of individual data points are not themselves anomalous in either a contextual or global sense

Why is it important to identify outliers??Machine Learning algorithms are sensitive to the range and distribution of values. Outliers can mislead ML Models, leading to longer training times, less accuracy, and ultimately worse results. However, not all the ML work is impacted by outliers, for some algorithms you can safely ignore them.

Outliers Sensitive Algorithms: Linear Regression, Logistic Regression, Support Vector Machine
Outliers Immune Algorithms: All tree-based or complex algorithms

Business-wise, you should aim to understand why there is an outlier, and either you can remove it. For example, if you have a feature that represents the height of a person, and one of the observations contains, instead of a number, a string with a weird value like = “abc cm”, and since the height cannot contain such value, it is safe to drop it.

How to detect an outlier?

You can easily spot outliers by utilizing different types of visuals:

Boxplot

领英推荐

Handling Outliers in ML: Best Practices for Robust…

Iain Brown PhD 1 年前

Data Scientist’s Dilemma: The Cold Start Problem – Ten…

Kirk Borne, Ph.D. 6 年前

Demystifying Machine Learning Challenges – Imbalanced…

Amlgo Labs 1 年前

Here’s what the boxplot shows:

The median is the value of the element at the centre of the ranked series. Note, that the median is less influenced by outliers, so it is the median that is displayed in the centre, and not the arithmetic mean.
The top quartile (Q3 or 75%) is the score above which only 25% of the values are. The lower quartile (Q1 or 25%) is the value below which only 25% of the values are.
The interquartile range (IQR) is the difference between the 75% and 25% quartile. Within this range lies 50% of the values. For example, if the range is narrow, then the members of the subgroup are unanimous in their assessments. If it is wide, then there is no homogeneous opinion.

Based on the above, you typically can detect outliers that are above “25% percentile minus 1.5 x IQR” or below “75% percentile plus 1.5 x IQR” as shown in the picture above.

2. Histogram

Histograms aggregate numerical data into evenly spaced groups called bins and display the frequency of occurrence of values in each of the bins. A bar chart is created using a number field or a percentage/ratio field. Histograms help answer questions such as: What is the distribution of values and how often do they appear in the data set?

By increasing and decreasing the number of bins, you can influence how your data is analysed. Although the data itself does not change, its appearance may change. Choosing the right number of bins is important to correctly interpret patterns in the data. Too few bins can hide some patterns, and too many can exaggerate the value of small, acceptable data changes. The correct number of bins will reveal patterns that are invisible when using, for example, large bins.

3.?Scatterplot

A scatterplot shows the distribution of set elements between two variables. The values of one independent parameter are plotted along the X axis, the values of the second dependent parameter — along the Y axis.

The patterns displayed on the scatterplots allow you to see different types of correlation. Points that are significantly removed from the general cluster/correlation line of points are called outliers.

4.?Z-score

The z-score can also be referred to as the standard score giving a representation of the distribution of the data relative to the mean. This score indicates how many standard deviations below or above a given population.

The value of z can be seen on the bell curve. where Z-scores range from -3 standard deviations (leftmost corner of the normal distribution curve) to +3 standard deviations (rightmost corner of the normal distribution curve). And in most cases values greater or less -+3 are identified as outliers.

How do I deal with outliers?

Once you have detected the outliers in your dataset you have the following 3 actions:

Remove outliers. Typically you are ok to drop an outlier if you have a really good sense of what range the data should fall in, like people’s ages, you can safely drop values that are outside of that range.
Change the value of the outlier (eg replace the value with a mean value or maximum cap value, eg 90% percentile)
Keep it. You shouldn’t aim to drop outliers If, for example, 20%-40% of your data are outliers, then it should not necessarily be treated as an outlier, instead, you should look further into it.

要查看或添加评论，请登录

Amir T.的更多文章

PyCaret: automated machine learning

2022年12月17日

PyCaret: automated machine learning

The machine learning workflow Before we dive into PyCaret, let’s talk about a typical machine-learning workflow. Here…
Machine Learning & AI Hackathons That Anyone Can Participate In

2022年12月17日

Machine Learning & AI Hackathons That Anyone Can Participate In

If you’re looking to get into the field of data science, one great way to get started is by participating in a machine…
Setup Neural Networks Hyperparameters for Best Results

2022年12月17日

Setup Neural Networks Hyperparameters for Best Results

1. Number of Hidden Layers The first hyperparameter in a neural network is the number of hidden layers.
Data Required in Marketing Mix Modeling

2022年12月14日

Data Required in Marketing Mix Modeling

Data Request This phase is initiated by requesting the data necessary to achieve goals that were set during the kickoff…
Top Clustering Algorithms You Should Know Instead of K-means Clustering

2022年12月13日

Top Clustering Algorithms You Should Know Instead of K-means Clustering

K-means clustering is arguably one of the most commonly used clustering techniques in the world of data science…
Metrics for Evaluation of Supervised Machine Learning Models

2022年12月12日

Metrics for Evaluation of Supervised Machine Learning Models

Supervised machine learning algorithms try to model the relationship between features (independent variables) and a…
Detecting Data Drift in Machine Learning

2022年12月12日

Detecting Data Drift in Machine Learning

In machine learning, model drift means that the machine learning model becomes less and less accurate due to the…
Important notes about "Imbalanced Datasets" for Data Scientists

2022年12月11日

Important notes about "Imbalanced Datasets" for Data Scientists

Firstly, it is important to understand why imbalanced datasets are a crucial problem that needs to be addressed. We can…
Data Science "Scikit-Learn Cheat Sheet" for Python

2022年12月11日

Data Science "Scikit-Learn Cheat Sheet" for Python

Scikit-learn is a free software machine learning library for the Python programming language. It features various…
Some Examples to Learn Pandas’ Data Manipulation and Analysis for a Data Scientist

2022年12月4日

Some Examples to Learn Pandas’ Data Manipulation and Analysis for a Data Scientist

1. Importing and Shape of the Data Question 1: Import summer.

See all articles

Ways to detect outliers that every data scientist should know

Amir T.

Marketing Products Senior Data Analyst

领英推荐

Amir T.的更多文章

社区洞察

其他会员也浏览了

Data Scaling and Training space in Machine Learning. A Statistical perspective.

Feature Engineering for Data Engineers: Building Blocks for ML Success

From Data Chaos to Clarity: The Magic of Machine Learning Algorithms

Not more, get better data!

Data Preprocessing Techniques In Machine Learning:

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Class 13 - DATA TRANSFORMATION, SORTING & VISUALIZATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Building the Machine Learning Model

Overfitting and Underfitting in a Nutshell

领英推荐

Amir T.的更多文章

PyCaret: automated machine learning

Machine Learning & AI Hackathons That Anyone Can Participate In

Setup Neural Networks Hyperparameters for Best Results

Data Required in Marketing Mix Modeling

Top Clustering Algorithms You Should Know Instead of K-means Clustering

Metrics for Evaluation of Supervised Machine Learning Models

Detecting Data Drift in Machine Learning

Important notes about "Imbalanced Datasets" for Data Scientists

Data Science "Scikit-Learn Cheat Sheet" for Python

Some Examples to Learn Pandas’ Data Manipulation and Analysis for a Data Scientist

社区洞察

其他会员也浏览了

Data Scaling and Training space in Machine Learning. A Statistical perspective.

Feature Engineering for Data Engineers: Building Blocks for ML Success

From Data Chaos to Clarity: The Magic of Machine Learning Algorithms

Not more, get better data!

Data Preprocessing Techniques In Machine Learning:

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Class 13 - DATA TRANSFORMATION, SORTING & VISUALIZATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Building the Machine Learning Model

Overfitting and Underfitting in a Nutshell