Dear Data Scientists, Stop removing your outliers!
Deena Gergis
AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time
Stumbling upon Abhishek's post on LinkedIn about the academic boxplots vs. real-life boxplots, I couldn’t help but raise the notorious question:?
“Is that a feature or a bug?”
Yes, academics love normal distributions that produce those neat box plots. And this is why we vigorously learn them during our studies. Only to step into the real world afterwards and discover that real life is not mathematically-neat and a significant portion of your data points are, indeed, heavy-tailed.?
Reason?? There is a very good reason, that was explained by the economist Vilfredo Pareto, which is commonly known as the 80/20 rule or the Pareto principle. This principle states that “for many outcomes, roughly 80% of consequences come from 20% of causes (the "vital few")”.?
Example? Lots of commercial organisations find that 80% of their sales is produced from only 20% of their customers.??And this is the reason that most companies have separate “key account management” divisions, that are only focused on those top few lucrative customers.?
But if we deal with this 80/20 phenomena using the standard data cleaning approaches, those top customers who are essential for your organisation, will be marked as outliers and simply discarded.??That means that your data-outliers might be the reason that your business exists in the first place.
And this leaves you, my very dear Gaussian-acquainted reader, with the very interesting and border-line confusing question:?
Are your “outliers” actually a bug, or a feature??*
***
So what should you do?
Practically speaking, how should you deal with those extreme values to best model your data??Here are three ways that you might want to consider:?
?I. Filter?
There is a reason for why this is considered as standard practice: extreme values will - in most of the cases - decrease the robustness of your model. So removing them might have an increase on the model’s performance from a technical sense. And this is why lots of data scientists use this method as default.?
However, if your extreme values are indeed valid ones and not just noise, you will be discarding a very important subset for the business. And the even trickier problem is that this will not be reflected in your performance metrics during development, as you have thrown away the data in the first place. The issues will only appear in production (Alas!)
So, unless you have a solid reason to believe that those extreme values are indeed anomalies (e.g data from a broken sensors), it will be in your best interest to consider the following two methods.?
**
II. Plateau?
Surprisingly, using log values for heavy tailed data points is one of the simplest and most effective solutions. However, there are two points that you need to be very careful with here. The first point is the method of the prior normalisation - depending on the scale of the your data point. And the second point is inverting the output values back using the exponential function.
Bonus tip: You can also use sklearn's target transformers to achieve this in a seamless pipeline.
**
III. Split?
This method works very well in settings that follow the 80/20 and translates seamlessly with most organisations that separate the processes according to impact, e.g: Key Account Management units for top customers, separate sales channels for top products, …??
However, technically, it can be complicated. To use a multi-models, you would also need an initial classification model that will decide which model to use afterwards. So setting up and maintaining such pipeline could be complicated. You also need to make sure that you have enough data for the “extreme model”, which is not usually the case.?
****
In a nut-shell: If you simply discard your extreme values, think again. You might be better off with using the log values or a hierarchal model.?
****
* Comic credits: https://www.facebook.com/sandserifcomics/
Business Analyst | Data Analyst | MSc in Business Analytics
2 年Can I contact you? I had sent you a message Deena Gergis
LifeHub animation. Innov4Ag program at Bayer
2 年Ceferino, I guess you will like this post ;-) To use in your learnings?
Co-Owner at Cosmexa
2 年Mathematically we must identify what is outliers first... And then study the impact on data if we delete this outlier from data.
Data Analyst and Business Intelligence Specialist | Tableau | Alteryx | SQL | Data Visualization
2 年Very useful, thanks
Attended faculty of computer and data science
2 年You always share great informations , thank you ??