Ways to detect outliers that every data scientist should know
An outlier is a data point that differs significantly from other observations. Outliers distort the feature distribution and ML work significantly, therefore we need to observe and form a strategy to deal with them.
How does an outlier pop up?
The appearance of such observations can be caused by:
Depending on the nature of outliers, you may keep them or exclude them, e.g. in the case of experimental errors you would like to remove them.
What are the types of outliers??There are 3 types of outliers:
. Conditional: Observations are considered anomalous given the context. For example, the economic performance of a country falls sharply due to the global economic crisis, and for some time lower rates become the norm.
3. Collective:?a set of observations that are close to each other and have close anomalous values. A subset of points is considered anomalous if these values as an aggregate deviate significantly from the entire data set, but the values of individual data points are not themselves anomalous in either a contextual or global sense
Why is it important to identify outliers??Machine Learning algorithms are sensitive to the range and distribution of values. Outliers can mislead ML Models, leading to longer training times, less accuracy, and ultimately worse results. However, not all the ML work is impacted by outliers, for some algorithms you can safely ignore them.
Business-wise, you should aim to understand why there is an outlier, and either you can remove it. For example, if you have a feature that represents the height of a person, and one of the observations contains, instead of a number, a string with a weird value like = “abc cm”, and since the height cannot contain such value, it is safe to drop it.
How to detect an outlier?
You can easily spot outliers by utilizing different types of visuals:
领英推荐
Here’s what the boxplot shows:
Based on the above, you typically can detect outliers that are above “25% percentile minus 1.5 x IQR” or below “75% percentile plus 1.5 x IQR” as shown in the picture above.
Histograms aggregate numerical data into evenly spaced groups called bins and display the frequency of occurrence of values in each of the bins. A bar chart is created using a number field or a percentage/ratio field. Histograms help answer questions such as: What is the distribution of values and how often do they appear in the data set?
By increasing and decreasing the number of bins, you can influence how your data is analysed. Although the data itself does not change, its appearance may change. Choosing the right number of bins is important to correctly interpret patterns in the data. Too few bins can hide some patterns, and too many can exaggerate the value of small, acceptable data changes. The correct number of bins will reveal patterns that are invisible when using, for example, large bins.
3.?Scatterplot
A scatterplot shows the distribution of set elements between two variables. The values of one independent parameter are plotted along the X axis, the values of the second dependent parameter — along the Y axis.
The patterns displayed on the scatterplots allow you to see different types of correlation. Points that are significantly removed from the general cluster/correlation line of points are called outliers.
4.?Z-score
The z-score can also be referred to as the standard score giving a representation of the distribution of the data relative to the mean. This score indicates how many standard deviations below or above a given population.
The value of z can be seen on the bell curve. where Z-scores range from -3 standard deviations (leftmost corner of the normal distribution curve) to +3 standard deviations (rightmost corner of the normal distribution curve). And in most cases values greater or less -+3 are identified as outliers.
How do I deal with outliers?
Once you have detected the outliers in your dataset you have the following 3 actions: