登录查看更多内容

Dear Data Scientists, Stop removing your outliers!

Deena Gergis

AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time

发布日期: 2022年10月24日

Stumbling upon Abhishek's post on LinkedIn about the academic boxplots vs. real-life boxplots, I couldn’t help but raise the notorious question:?

“Is that a feature or a bug?”

Yes, academics love normal distributions that produce those neat box plots. And this is why we vigorously learn them during our studies. Only to step into the real world afterwards and discover that real life is not mathematically-neat and a significant portion of your data points are, indeed, heavy-tailed.?

Reason?? There is a very good reason, that was explained by the economist Vilfredo Pareto, which is commonly known as the 80/20 rule or the Pareto principle. This principle states that “for many outcomes, roughly 80% of consequences come from 20% of causes (the "vital few")”.?

Example? Lots of commercial organisations find that 80% of their sales is produced from only 20% of their customers.??And this is the reason that most companies have separate “key account management” divisions, that are only focused on those top few lucrative customers.?

But if we deal with this 80/20 phenomena using the standard data cleaning approaches, those top customers who are essential for your organisation, will be marked as outliers and simply discarded.??That means that your data-outliers might be the reason that your business exists in the first place.

And this leaves you, my very dear Gaussian-acquainted reader, with the very interesting and border-line confusing question:?

Are your “outliers” actually a bug, or a feature??*

***

So what should you do?

Practically speaking, how should you deal with those extreme values to best model your data??Here are three ways that you might want to consider:?

?I. Filter?

Method: Treat your extreme values as outliers and filter them out.
Business translation: The most lucrative and important customers, aren’t that important and can be discarded.?

There is a reason for why this is considered as standard practice: extreme values will - in most of the cases - decrease the robustness of your model. So removing them might have an increase on the model’s performance from a technical sense. And this is why lots of data scientists use this method as default.?

领英推荐

The Problems Posed by the Rising Data Complexity

Namasys Analytics 2 年前

Leaders Are Readers -- November 2024

Shawn Campbell 3 个月前

Data Blunders 101: Common Mistakes Companies Make and…

Periculum 1 年前

However, if your extreme values are indeed valid ones and not just noise, you will be discarding a very important subset for the business. And the even trickier problem is that this will not be reflected in your performance metrics during development, as you have thrown away the data in the first place. The issues will only appear in production (Alas!)

So, unless you have a solid reason to believe that those extreme values are indeed anomalies (e.g data from a broken sensors), it will be in your best interest to consider the following two methods.?

II. Plateau?

Method: “Decrease the extremeness” of those values by using their log value, instead of their absolutes.
Business translation: The most lucrative and important customers behave similarly to the rest of the customers, just in a systematically more extreme manner.?

Surprisingly, using log values for heavy tailed data points is one of the simplest and most effective solutions. However, there are two points that you need to be very careful with here. The first point is the method of the prior normalisation - depending on the scale of the your data point. And the second point is inverting the output values back using the exponential function.

Bonus tip: You can also use sklearn's target transformers to achieve this in a seamless pipeline.

III. Split?

Method: Build two separate models, one for the extreme values and another for the regular ones.?
Business translation: We have two separate types of customers, and they are independent - each behaving differently.?

This method works very well in settings that follow the 80/20 and translates seamlessly with most organisations that separate the processes according to impact, e.g: Key Account Management units for top customers, separate sales channels for top products, …??

However, technically, it can be complicated. To use a multi-models, you would also need an initial classification model that will decide which model to use afterwards. So setting up and maintaining such pipeline could be complicated. You also need to make sure that you have enough data for the “extreme model”, which is not usually the case.?

****

In a nut-shell: If you simply discard your extreme values, think again. You might be better off with using the log values or a hierarchal model.?

****

* Comic credits: https://www.facebook.com/sandserifcomics/

Marina Ashraf

Business Analyst | Data Analyst | MSc in Business Analytics

2 年

Can I contact you? I had sent you a message Deena Gergis

Catherine Sirven

LifeHub animation. Innov4Ag program at Bayer

2 年

Ceferino, I guess you will like this post ;-) To use in your learnings?

1 次回应

Ahmed Elsheikh

Co-Owner at Cosmexa

2 年

Mathematically we must identify what is outliers first... And then study the impact on data if we delete this outlier from data.

Ahmet Hac?

Data Analyst and Business Intelligence Specialist | Tableau | Alteryx | SQL | Data Visualization

2 年

Very useful, thanks

Nour Khaled

Attended faculty of computer and data science

2 年

You always share great informations , thank you ??

查看更多评论

要查看或添加评论，请登录

Deena Gergis的更多文章

Your intuitive guide to interpret SHAP's beeswarm plot

2023年12月5日

Your intuitive guide to interpret SHAP's beeswarm plot

The SHAP beeswarm plot is a powerful tool for interpreting machine learning models, but it can be a bit intimidating at…

4 条评论
Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

2022年12月5日

Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

Dear Deena, I am an experienced Graphics Designer, but I am very interested in Machine Learning. Unfortunately, it is…

15 条评论
The 7 steps to choose the best topic for your #DataScience graduation project

2022年9月13日

The 7 steps to choose the best topic for your #DataScience graduation project

So, you’ve made the very wise decision of studying Data Science a while back. You have studied a bunch of various…

17 条评论
Are your KPIs deceiving you?

2022年2月21日

Are your KPIs deceiving you?

What if I told you that your company is implementing a process that is cutting down 7% of your bills? Awesome, you…

16 条评论
What makes successful people ... successful?

2021年11月8日

What makes successful people ... successful?

Here is a million dollar question: What makes successful people successful? This should be a simple answer: “Hard work…

25 条评论
MLflow: a better way to track your models

2021年9月27日

MLflow: a better way to track your models

In a perfect world, you would get clean data that you will feed into a single machine learning model and voila, done…

10 条评论
Beginner's guide: The top 10 Data Science libraries in Python

2021年8月2日

Beginner's guide: The top 10 Data Science libraries in Python

Dear aspiring Data Scientist, You think that Data Science is the coolest. Yes, you are right! So you decide to pursue…

33 条评论
Beyond the discomfort of learning

2021年7月6日

Beyond the discomfort of learning

"If a company is expecting me to learn, why would it hire me in the first place?" Few weeks ago, I received this…

10 条评论
Your 7 references to master Data Science

2021年5月24日

Your 7 references to master Data Science

If you are starting your career in Data Science, chances are you are feeling lost in all of the different resources…

15 条评论
Dear future daughter, ...

2021年3月8日

Dear future daughter, ...

Dear future daughter, One day, you might come into a world that will teach you false standards of who you should be as…

6 条评论

See all articles

Dear Data Scientists, Stop removing your outliers!

Deena Gergis

AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time

“Is that a feature or a bug?”

***

So what should you do?

?I. Filter?

领英推荐

II. Plateau?

III. Split?

Deena Gergis的更多文章

社区洞察

其他会员也浏览了

There is no truth in data, which truth would you like?

The Myth of Objective Data

Who Needs to Know?

Intensional and Extensional Quality

Can Likert Scale Data ever be Continuous?

A Gentle Introduction to Probabilistic Data Structures

What Spock Teaches Us about Building Better LLMs

Unlocking the Hidden Secrets of Your Data: The Power of Knowledge Graphs & RAG — Part 1

RAG Systems: TOP 3 pros and cons (compared to fine tuning)

How much is the data worth?

“Is that a feature or a bug?”

***

So what should you do?

?I. Filter?

领英推荐

II. Plateau?

III. Split?

Deena Gergis的更多文章

Your intuitive guide to interpret SHAP's beeswarm plot

Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

The 7 steps to choose the best topic for your #DataScience graduation project

Are your KPIs deceiving you?

What makes successful people ... successful?

MLflow: a better way to track your models

Beginner's guide: The top 10 Data Science libraries in Python

Beyond the discomfort of learning

Your 7 references to master Data Science

Dear future daughter, ...

社区洞察

其他会员也浏览了

There is no truth in data, which truth would you like?

The Myth of Objective Data

Who Needs to Know?

Intensional and Extensional Quality

Can Likert Scale Data ever be Continuous?

A Gentle Introduction to Probabilistic Data Structures

What Spock Teaches Us about Building Better LLMs

Unlocking the Hidden Secrets of Your Data: The Power of Knowledge Graphs & RAG — Part 1

RAG Systems: TOP 3 pros and cons (compared to fine tuning)

How much is the data worth?