Ways to detect outliers that every data scientist should know

Ways to detect outliers that every data scientist should know

An outlier is a data point that differs significantly from other observations. Outliers distort the feature distribution and ML work significantly, therefore we need to observe and form a strategy to deal with them.

How does an outlier pop up?

The appearance of such observations can be caused by:

  • Differences in measurement methods, for example, the sensitivity of the sensor has changed;
  • Experimental errors, where outliers may be the result of an error during data collection;
  • Introduction of new processes;
  • An error during the data collection stage, or data handling;
  • or an indicator of variances in observations.

Depending on the nature of outliers, you may keep them or exclude them, e.g. in the case of experimental errors you would like to remove them.

What are the types of outliers??There are 3 types of outliers:

  1. Global:?Also known as Point Outlier. This observation extends far beyond the entire dataset. For example: in a class, all students are the same age, but there is a record about a student aged 500 years.

. Conditional: Observations are considered anomalous given the context. For example, the economic performance of a country falls sharply due to the global economic crisis, and for some time lower rates become the norm.

3. Collective:?a set of observations that are close to each other and have close anomalous values. A subset of points is considered anomalous if these values as an aggregate deviate significantly from the entire data set, but the values of individual data points are not themselves anomalous in either a contextual or global sense

Why is it important to identify outliers??Machine Learning algorithms are sensitive to the range and distribution of values. Outliers can mislead ML Models, leading to longer training times, less accuracy, and ultimately worse results. However, not all the ML work is impacted by outliers, for some algorithms you can safely ignore them.

  • Outliers Sensitive Algorithms: Linear Regression, Logistic Regression, Support Vector Machine
  • Outliers Immune Algorithms: All tree-based or complex algorithms

Business-wise, you should aim to understand why there is an outlier, and either you can remove it. For example, if you have a feature that represents the height of a person, and one of the observations contains, instead of a number, a string with a weird value like = “abc cm”, and since the height cannot contain such value, it is safe to drop it.

How to detect an outlier?

You can easily spot outliers by utilizing different types of visuals:

  1. Boxplot

Here’s what the boxplot shows:

  • The median is the value of the element at the centre of the ranked series. Note, that the median is less influenced by outliers, so it is the median that is displayed in the centre, and not the arithmetic mean.
  • The top quartile (Q3 or 75%) is the score above which only 25% of the values are. The lower quartile (Q1 or 25%) is the value below which only 25% of the values are.
  • The interquartile range (IQR) is the difference between the 75% and 25% quartile. Within this range lies 50% of the values. For example, if the range is narrow, then the members of the subgroup are unanimous in their assessments. If it is wide, then there is no homogeneous opinion.

Based on the above, you typically can detect outliers that are above “25% percentile minus 1.5 x IQR” or below “75% percentile plus 1.5 x IQR” as shown in the picture above.

2. Histogram

Histograms aggregate numerical data into evenly spaced groups called bins and display the frequency of occurrence of values in each of the bins. A bar chart is created using a number field or a percentage/ratio field. Histograms help answer questions such as: What is the distribution of values and how often do they appear in the data set?

By increasing and decreasing the number of bins, you can influence how your data is analysed. Although the data itself does not change, its appearance may change. Choosing the right number of bins is important to correctly interpret patterns in the data. Too few bins can hide some patterns, and too many can exaggerate the value of small, acceptable data changes. The correct number of bins will reveal patterns that are invisible when using, for example, large bins.

3.?Scatterplot

A scatterplot shows the distribution of set elements between two variables. The values of one independent parameter are plotted along the X axis, the values of the second dependent parameter — along the Y axis.

The patterns displayed on the scatterplots allow you to see different types of correlation. Points that are significantly removed from the general cluster/correlation line of points are called outliers.

4.?Z-score

The z-score can also be referred to as the standard score giving a representation of the distribution of the data relative to the mean. This score indicates how many standard deviations below or above a given population.

The value of z can be seen on the bell curve. where Z-scores range from -3 standard deviations (leftmost corner of the normal distribution curve) to +3 standard deviations (rightmost corner of the normal distribution curve). And in most cases values greater or less -+3 are identified as outliers.

How do I deal with outliers?

Once you have detected the outliers in your dataset you have the following 3 actions:

  1. Remove outliers. Typically you are ok to drop an outlier if you have a really good sense of what range the data should fall in, like people’s ages, you can safely drop values that are outside of that range.
  2. Change the value of the outlier (eg replace the value with a mean value or maximum cap value, eg 90% percentile)
  3. Keep it. You shouldn’t aim to drop outliers If, for example, 20%-40% of your data are outliers, then it should not necessarily be treated as an outlier, instead, you should look further into it.

要查看或添加评论,请登录

Amir T.的更多文章

社区洞察

其他会员也浏览了