Dear Data Scientists, Stop removing your outliers!
Comic credits: https://www.facebook.com/sandserifcomics/

Dear Data Scientists, Stop removing your outliers!

No alt text provided for this image

Stumbling upon Abhishek's post on LinkedIn about the academic boxplots vs. real-life boxplots, I couldn’t help but raise the notorious question:?

“Is that a feature or a bug?”

Yes, academics love normal distributions that produce those neat box plots. And this is why we vigorously learn them during our studies. Only to step into the real world afterwards and discover that real life is not mathematically-neat and a significant portion of your data points are, indeed, heavy-tailed.?

Reason?? There is a very good reason, that was explained by the economist Vilfredo Pareto, which is commonly known as the 80/20 rule or the Pareto principle. This principle states that “for many outcomes, roughly 80% of consequences come from 20% of causes (the "vital few")”.?

Example? Lots of commercial organisations find that 80% of their sales is produced from only 20% of their customers.??And this is the reason that most companies have separate “key account management” divisions, that are only focused on those top few lucrative customers.?

But if we deal with this 80/20 phenomena using the standard data cleaning approaches, those top customers who are essential for your organisation, will be marked as outliers and simply discarded.??That means that your data-outliers might be the reason that your business exists in the first place.

And this leaves you, my very dear Gaussian-acquainted reader, with the very interesting and border-line confusing question:?

Are your “outliers” actually a bug, or a feature??*
No alt text provided for this image

***

So what should you do?

Practically speaking, how should you deal with those extreme values to best model your data??Here are three ways that you might want to consider:?

?I. Filter?

  • Method: Treat your extreme values as outliers and filter them out.
  • Business translation: The most lucrative and important customers, aren’t that important and can be discarded.?

There is a reason for why this is considered as standard practice: extreme values will - in most of the cases - decrease the robustness of your model. So removing them might have an increase on the model’s performance from a technical sense. And this is why lots of data scientists use this method as default.?

However, if your extreme values are indeed valid ones and not just noise, you will be discarding a very important subset for the business. And the even trickier problem is that this will not be reflected in your performance metrics during development, as you have thrown away the data in the first place. The issues will only appear in production (Alas!)

So, unless you have a solid reason to believe that those extreme values are indeed anomalies (e.g data from a broken sensors), it will be in your best interest to consider the following two methods.?

**

II. Plateau?

  • Method: “Decrease the extremeness” of those values by using their log value, instead of their absolutes.
  • Business translation: The most lucrative and important customers behave similarly to the rest of the customers, just in a systematically more extreme manner.?

Surprisingly, using log values for heavy tailed data points is one of the simplest and most effective solutions. However, there are two points that you need to be very careful with here. The first point is the method of the prior normalisation - depending on the scale of the your data point. And the second point is inverting the output values back using the exponential function.

Bonus tip: You can also use sklearn's target transformers to achieve this in a seamless pipeline.

**

III. Split?

  • Method: Build two separate models, one for the extreme values and another for the regular ones.?
  • Business translation: We have two separate types of customers, and they are independent - each behaving differently.?

This method works very well in settings that follow the 80/20 and translates seamlessly with most organisations that separate the processes according to impact, e.g: Key Account Management units for top customers, separate sales channels for top products, …??

However, technically, it can be complicated. To use a multi-models, you would also need an initial classification model that will decide which model to use afterwards. So setting up and maintaining such pipeline could be complicated. You also need to make sure that you have enough data for the “extreme model”, which is not usually the case.?

****

In a nut-shell: If you simply discard your extreme values, think again. You might be better off with using the log values or a hierarchal model.?

****

* Comic credits: https://www.facebook.com/sandserifcomics/

Marina Ashraf

Business Analyst | Data Analyst | MSc in Business Analytics

2 年

Can I contact you? I had sent you a message Deena Gergis

回复
Catherine Sirven

LifeHub animation. Innov4Ag program at Bayer

2 年

Ceferino, I guess you will like this post ;-) To use in your learnings?

Ahmed Elsheikh

Co-Owner at Cosmexa

2 年

Mathematically we must identify what is outliers first... And then study the impact on data if we delete this outlier from data.

回复
Ahmet Hac?

Data Analyst and Business Intelligence Specialist | Tableau | Alteryx | SQL | Data Visualization

2 年

Very useful, thanks

回复
Nour Khaled

Attended faculty of computer and data science

2 年

You always share great informations , thank you ??

回复

要查看或添加评论,请登录

Deena Gergis的更多文章

  • Your intuitive guide to interpret SHAP's beeswarm plot

    Your intuitive guide to interpret SHAP's beeswarm plot

    The SHAP beeswarm plot is a powerful tool for interpreting machine learning models, but it can be a bit intimidating at…

    4 条评论
  • Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

    Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

    Dear Deena, I am an experienced Graphics Designer, but I am very interested in Machine Learning. Unfortunately, it is…

    15 条评论
  • The 7 steps to choose the best topic for your #DataScience graduation project

    The 7 steps to choose the best topic for your #DataScience graduation project

    So, you’ve made the very wise decision of studying Data Science a while back. You have studied a bunch of various…

    17 条评论
  • Are your KPIs deceiving you?

    Are your KPIs deceiving you?

    What if I told you that your company is implementing a process that is cutting down 7% of your bills? Awesome, you…

    16 条评论
  • What makes successful people ... successful?

    What makes successful people ... successful?

    Here is a million dollar question: What makes successful people successful? This should be a simple answer: “Hard work…

    25 条评论
  • MLflow: a better way to track your models

    MLflow: a better way to track your models

    In a perfect world, you would get clean data that you will feed into a single machine learning model and voila, done…

    10 条评论
  • Beginner's guide: The top 10 Data Science libraries in Python

    Beginner's guide: The top 10 Data Science libraries in Python

    Dear aspiring Data Scientist, You think that Data Science is the coolest. Yes, you are right! So you decide to pursue…

    33 条评论
  • Beyond the discomfort of learning

    Beyond the discomfort of learning

    "If a company is expecting me to learn, why would it hire me in the first place?" Few weeks ago, I received this…

    10 条评论
  • Your 7 references to master Data Science

    Your 7 references to master Data Science

    If you are starting your career in Data Science, chances are you are feeling lost in all of the different resources…

    15 条评论
  • Dear future daughter, ...

    Dear future daughter, ...

    Dear future daughter, One day, you might come into a world that will teach you false standards of who you should be as…

    6 条评论

社区洞察

其他会员也浏览了