登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Dear Data Scientists, Stop removing your outliers!

Deena Gergis

AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time

发布日期: 2022年10月24日

+ 关注

Stumbling upon Abhishek's post on LinkedIn about the academic boxplots vs. real-life boxplots, I couldn’t help but raise the notorious question:?

“Is that a feature or a bug?”

Yes, academics love normal distributions that produce those neat box plots. And this is why we vigorously learn them during our studies. Only to step into the real world afterwards and discover that real life is not mathematically-neat and a significant portion of your data points are, indeed, heavy-tailed.?

Reason?? There is a very good reason, that was explained by the economist Vilfredo Pareto, which is commonly known as the 80/20 rule or the Pareto principle. This principle states that “for many outcomes, roughly 80% of consequences come from 20% of causes (the "vital few")”.?

Example? Lots of commercial organisations find that 80% of their sales is produced from only 20% of their customers.??And this is the reason that most companies have separate “key account management” divisions, that are only focused on those top few lucrative customers.?

But if we deal with this 80/20 phenomena using the standard data cleaning approaches, those top customers who are essential for your organisation, will be marked as outliers and simply discarded.??That means that your data-outliers might be the reason that your business exists in the first place.

And this leaves you, my very dear Gaussian-acquainted reader, with the very interesting and border-line confusing question:?

Are your “outliers” actually a bug, or a feature??*

***

So what should you do?

Practically speaking, how should you deal with those extreme values to best model your data??Here are three ways that you might want to consider:?

?I. Filter?

Method: Treat your extreme values as outliers and filter them out.
Business translation: The most lucrative and important customers, aren’t that important and can be discarded.?

There is a reason for why this is considered as standard practice: extreme values will - in most of the cases - decrease the robustness of your model. So removing them might have an increase on the model’s performance from a technical sense. And this is why lots of data scientists use this method as default.?

However, if your extreme values are indeed valid ones and not just noise, you will be discarding a very important subset for the business. And the even trickier problem is that this will not be reflected in your performance metrics during development, as you have thrown away the data in the first place. The issues will only appear in production (Alas!)

So, unless you have a solid reason to believe that those extreme values are indeed anomalies (e.g data from a broken sensors), it will be in your best interest to consider the following two methods.?

II. Plateau?

Method: “Decrease the extremeness” of those values by using their log value, instead of their absolutes.
Business translation: The most lucrative and important customers behave similarly to the rest of the customers, just in a systematically more extreme manner.?

Surprisingly, using log values for heavy tailed data points is one of the simplest and most effective solutions. However, there are two points that you need to be very careful with here. The first point is the method of the prior normalisation - depending on the scale of the your data point. And the second point is inverting the output values back using the exponential function.

Bonus tip: You can also use sklearn's target transformers to achieve this in a seamless pipeline.

III. Split?

Method: Build two separate models, one for the extreme values and another for the regular ones.?
Business translation: We have two separate types of customers, and they are independent - each behaving differently.?

This method works very well in settings that follow the 80/20 and translates seamlessly with most organisations that separate the processes according to impact, e.g: Key Account Management units for top customers, separate sales channels for top products, …??

However, technically, it can be complicated. To use a multi-models, you would also need an initial classification model that will decide which model to use afterwards. So setting up and maintaining such pipeline could be complicated. You also need to make sure that you have enough data for the “extreme model”, which is not usually the case.?

****

In a nut-shell: If you simply discard your extreme values, think again. You might be better off with using the log values or a hierarchal model.?

****

* Comic credits: https://www.facebook.com/sandserifcomics/

Marina Ashraf

Business Analyst | Data Analyst | MSc in Business Analytics

2 年

Can I contact you? I had sent you a message Deena Gergis

Catherine Sirven

LifeHub animation. Innov4Ag program at Bayer

2 年

Ceferino, I guess you will like this post ;-) To use in your learnings?

1 次回应

Ahmed Elsheikh

Co-Owner at Cosmexa

2 年

Mathematically we must identify what is outliers first... And then study the impact on data if we delete this outlier from data.

Ahmet Hac?

Data Analyst and Business Intelligence Specialist | Tableau | Alteryx | SQL | Data Visualization

2 年

Very useful, thanks

Nour Khaled

Attended faculty of computer and data science

2 年

You always share great informations , thank you ??

查看更多评论

要查看或添加评论，请登录

查看全部

Dear Data Scientists, Stop removing your outliers!

Deena Gergis

AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time

“Is that a feature or a bug?”

***

So what should you do?

?I. Filter?

II. Plateau?

III. Split?

更多精彩文章

社区洞察

“Is that a feature or a bug?”

***

So what should you do?

?I. Filter?

II. Plateau?

III. Split?

Your intuitive guide to interpret SHAP's beeswarm plot

2023年12月5日

Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

2022年12月5日

The 7 steps to choose the best topic for your #DataScience graduation project

2022年9月13日

Are your KPIs deceiving you?

2022年2月21日

What makes successful people ... successful?

2021年11月8日

MLflow: a better way to track your models

2021年9月27日

Beginner's guide: The top 10 Data Science libraries in Python

2021年8月2日

Beyond the discomfort of learning

2021年7月6日

Your 7 references to master Data Science

2021年5月24日

Dear future daughter, ...

2021年3月8日

社区洞察