登录查看更多内容

Your intuitive guide to interpret SHAP's beeswarm plot

Deena Gergis

AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time

发布日期: 2023年12月5日

The SHAP beeswarm plot is a powerful tool for interpreting machine learning models, but it can be a bit intimidating at first glance. The insights that you can drive from this plot is worth cracking its complexity, as it has become one of the standard steps after modeling.

So, why bother?

A SHAP beeswarm plot will help you:

Identify the most important features in a model: SHAP plot can show you which features have the biggest impact on the model's predictions. This can help you to understand how the model works and to identify any features that are redundant or unnecessary
Understand how individual features affect predictions: SHAP plot can show you how the predicted value of an observation changes as you vary the value of a single feature. This can help you to understand how the model is using each feature to make predictions.

So, dear reader, let's crack this plot's complexity together

Cracking the SHAP's beeswarm plot in 3 simple steps

To understand the SHAP's beeswarm plot, follow those simple 3 steps:

1. Review your model's target variable

End of the day, this plot represents the effect of input features on the final target variable prediction.

In our case, we trained a model on Stackoverflow's 2023 survey data to use the used languages and technologies used to predict if the role is a Data Scientist or a Data Engineer.

As the Data Scientist and Data Engineers are mutually exclusive in this set (i.e. you can be either one or another), I have decided to use only one target variable of both. And for some reason - AKA unconscious bias - I chose to model the target variable as the Data Scientist.

What does that mean:

Data Scientist: The higher the predicted value is, the more likely for the prediction to be a Data Scientist
Data Engineer: The lower the predicted value is, the more likely for the prediction to be a Data Engineer

2. Understand the SHAP's main axis

As with most plots of this world, the SHAP beeswarm plot also has an X and a Y axis.

I. X-axis: SHAP values:

Let's get started by how those values are calculated:

To intuitively explain those values, those values represent how did this input feature impact the final prediction. Those values have two components, the magnitude and the direction

Gregory Piatetsky-Shapiro 5 年前

When the Quick Fix Goes Wrong: The Dark Side of Auto-ML

Walter Shields 1 年前

DATA Pill #028 - how data-driven is your company…

Adam Kawa 2 年前

Magnitude: The higher the magnitude, the higher the impact of this feature on the prediction.
Direction / sign: As some of the values are negative and some are positive, let's go back to our modeled target variable to understand this better. Remember that we have designed the model to predict a higher value for Data Science prediction and a lower value for a Data Engineer prediction? This is exactly what is meant by the direction / sign.

II. Y-axis: Input features

That's a simple one. Your input features are now listed on the y-axis and sorted according to their importance.

3. Crack a single feature's beeswarm sub-plot

For each of the input features, you will find a subplot that is called a beeswarm color-coded by the input feature value. To understand this better, let's split them into two parts:

Beeswarm plot: a beeswarm is a plot that displays individual data points in a way that they don't overlap. That means that the width at a certain point represents the density.
Color-coding: The colors represent the feature values. In our case, our features were binary, where 1 (red) represents that the technology is being used, and 0 (blue) represents that the technology is not being used. .

Let's put the pieces together to understand what's happening with our top feature: Scikit-learn:

You see that the red points (i.e. when Sklearn is used) have a positive SHAP values (i.e. prediction tends to be towards a Data Scientist)

Another example? Let's inspect Apache Kafka

You see that the red points (i.e. when Kafka is used) have a negative SHAP values (i.e. prediction tends to be towards a Data Engineer)

Yes, my dear reader, it is as simple as that :-)

Want to learn more?

Play around: A stand-alone notebook that I have developed for this illustration is now publicly available on Kaggle: https://www.kaggle.com/code/deenagergis/20231022-shap-beeswarm-illustration. Clone and play around!

Read more: For a more comprehensive explanation about the SHAP values, refer to Christoph Molnar's publicly available book https://christophm.github.io/interpretable-ml-book/shap.html

Francisco Javier de la Cruz Montserrat

Head of Translational Bioinformatics at the Vall d'Hebron Hospital Institute of Research, ICREA Research Professor

1 周

Nice work! Thank you!

Xenia Ershova

Business & Systems consultant SAP | Data Science enthusiast | Process Optimization

3 个月

thank you! :)

Seif Ashraf

SWE @ AROMA Studios ??? | Flutter Enthusiast ???? | GDG Cairo Organizer

11 个月

Deena Gergis kindly check indox please

Ayman Med

11 个月

Love this

查看更多评论

要查看或添加评论，请登录

查看全部

Your intuitive guide to interpret SHAP's beeswarm plot

Deena Gergis

AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time

Cracking the SHAP's beeswarm plot in 3 simple steps

1. Review your model's target variable

2. Understand the SHAP's main axis

领英推荐

3. Crack a single feature's beeswarm sub-plot

Yes, my dear reader, it is as simple as that :-)

Want to learn more?

更多精彩文章

社区洞察

其他会员也浏览了

What is data analytics?

Data vs. Features: The Building Blocks of Data Science

KEY TRENDS IN DATA SCIENCE: 2021 EDITION

Bias Variance Tradeoff

How to detect drift with Evidently and MLFlow

The Full Stack Data Science Rabbit Hole

Learn the Art of Data Science in Five Steps

The secret is to find new Data with high predictive power

My Third Win in Kaggle's Data Science for Good Competition (with key tips)

Behind "Big Data" and "AI": Elements of Modern Data Science

Cracking the SHAP's beeswarm plot in 3 simple steps

1. Review your model's target variable

2. Understand the SHAP's main axis

领英推荐

3. Crack a single feature's beeswarm sub-plot

Yes, my dear reader, it is as simple as that :-)

Want to learn more?

Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

2022年12月5日

Dear Data Scientists, Stop removing your outliers!

2022年10月24日

The 7 steps to choose the best topic for your #DataScience graduation project

2022年9月13日

Are your KPIs deceiving you?

2022年2月21日

What makes successful people ... successful?

2021年11月8日

MLflow: a better way to track your models

2021年9月27日

Beginner's guide: The top 10 Data Science libraries in Python

2021年8月2日

Beyond the discomfort of learning

2021年7月6日

Your 7 references to master Data Science

2021年5月24日

Dear future daughter, ...

2021年3月8日

社区洞察

其他会员也浏览了

What is data analytics?

Data vs. Features: The Building Blocks of Data Science

KEY TRENDS IN DATA SCIENCE: 2021 EDITION

Bias Variance Tradeoff

How to detect drift with Evidently and MLFlow

The Full Stack Data Science Rabbit Hole

Learn the Art of Data Science in Five Steps

The secret is to find new Data with high predictive power

My Third Win in Kaggle's Data Science for Good Competition (with key tips)

Behind "Big Data" and "AI": Elements of Modern Data Science