SHAP is not all you need (or why you should always use permutation feature importance)

SHAP is not all you need (or why you should always use permutation feature importance)

Repost from Christoph Molnar

A most annoying misconception in the world of machine learning interpretability

This post is

  • 30% rant
  • 50% comparison of SHAP and permutation feature importance
  • 20% good news (announcement of the release date of conformal prediction book )

I just got a paper rejection.

The paper itself fills a theoretical and conceptual gap: While ML interpretation techniques such as partial dependence plots and permutation feature importance primarily describe the model, many (data) scientists use them to study the underlying data and phenomenon. Our paper discusses what’s needed to actually achieve the jump from model to data.

But that’s not what’s important today. Maybe I’ll explain the paper in another post.

Today I want to talk about a part of the criticism we received for the paper. Here are two quotes:

  • “SHAP graphs also contain all the information that PDPs contain”
  • “PFI is less informative than SHAP”

The reviewer draws the conclusion that ‘I do not see much value in the analysis of somewhat "inferior" feature-analysis methods [like PDP and PFI]’.

If you take this statement to its full conclusion everyone would have to stop working on PDP and PFI. And while we are at it, why not drop ALE plots, ICE plots, and counterfactual explanations and write yet-another SHAP extension paper?

The critic of the reviewer is wrong on at least two levels:

  • With this attitude, academia would be condemned to always study the hyped and shiny. It discourages thoroughness and diminishes the chance that “bets” on other lines of research are tested out.
  • In the case of SHAP, the reviewer is plain wrong. PDP and PFI are not a subset of SHAP. They are different techniques with different goals. And while, for example, PFI and SHAP can both produce importance plots, they are not the same.

If it were the first time that someone said that SHAP is all you need, it wouldn't be worthy of a post. But especially in “peer” review, the critique “You should be working on Shapley values / SHAP / LIME” was surprisingly common. And also elsewhere I often saw people with the attitude of “SHAP is all you need”.

It’s wrong and I’ll show why.

Short primer on SHAP and PFI

If you are already familiar with SHAP and PFI, just skip this section.

Let’s start with permutation feature importance, because this is one of the simplest interpretability methods to explain. It’s a model interpretation technique that assigns an importance value for each feature. The importance is computed as how much the model performance would drop if we shuffle a feature. The more the performance drops (aka loss increases), the more important the feature was for correct predictions.

Compute loss. Permute feature. Compute loss again. Compute difference. Simple.

SHAP is a method to compute Shapley values for machine learning predictions. It’s a so-called attribution method that fairly attributes the predicted value among the features. The computation is more complicated than for PFI and also the interpretation is somewhere between difficult and unclear.

SHAP produces many types of interpretation outputs: SHAP can be used to explain individual predictions (aka attributions). But if you compute Shapley values for all the instances in your data, you can also aggregate them. Then you get good-looking plots that show you some notion of feature dependence, some notion of feature importance, and some notion of feature interactions. All these notions are of course tied to the not-so-easy interpretation of Shapley values. For an overview of the plots, you can check out my SHAP Plots For Tabular Data Cheat Sheet.

SHAP Is Not All You Need

Believing that SHAP is all you need is a typical pitfall: assuming that 1 method is the best for all interpretation contexts.

Let’s walk through my favorite example for showing how SHAP importance can be inadequate.

An xgboost regression model was trained on simulated data. But all of the 20 features were simulated to have no relation with the target. In other words, any type of relationship that the model picks up is the result of overfitting. And for this experiment, we overfit the model on purpose because in this case PFI and SHAP will diverge quite drastically.

The example is from our paper on ML interpretability pitfalls:

Clearly, SHAP and PFI deviate in the bar plot above. PFI more or less shows that all 20 features are unimportant. But SHAP importance clearly shows that some of the features are important.

Which interpretation is the correct one?

Given the simulation setup where none of the features has a relation to the target, one could say that PFI results are correct and SHAP is wrong. But this answer is too simplistic. The choice of interpretation method really depends on what you use the importance values for. What is the question that you want to answer?

Because Shapley values are “correct” in the sense that they do what they are supposed to do: Attribute the prediction to the features. And in this case, changing the “important” features truly changes the model prediction. So if your goal tends towards understanding how the model “behaves”, SHAP might be the right choice.

But if you want to find out how relevant a feature was for the CORRECT prediction, SHAP is not a good option. Here PFI is the better choice since it links importance to model performance.

In a way, it boils down to the question of audit versus insight: SHAP importance is more about auditing how the model behaves. As in the simulated example, it’s useful to see how model predictions are affected by features X4, X6, and so on. For that SHAP importance is meaningful. But if your goal was to study the underlying data, then it’s completely misleading. Here PFI gives you a better idea of what’s really going on. Also, both importance plots work on different scales: SHAP may be interpreted on the scale of the prediction because SHAP importance is the average absolute change in prediction that was attributed to a feature. PFI is the average increase in loss when the feature information is destroyed (aka feature is permuted). Therefore PFI importance is on the scale of the loss.

A fallacy of the reviewer was to equate these different ideas of feature importance.

Unfortunately, this points towards a much larger issue in research on interpretability. The field is more method-driven than question-driven. We first develop methods, and then ask “what question do the methods really answer?”.

For SHAP, it’s not so easy to answer how the Shapley values are supposed to be interpreted.

Shapley values are also expensive to compute, especially if your model is not tree-based.

So there are many reasons not to use SHAP, but an “inferior” (as the reviewer said) interpretation method.

For another critique of Shapley values I recommend this post by Giles Hooker.

Prof. Dr. Holger von Jouanne-Diedrich

Follow me, I am a professor ?? (private account/all views my own)

1 年
David Young

CEO @ Goal Aligned Media | Digital Marketing, Advanced Analytics

1 年

Alastair, I liked your short primer on Shapley values and PFI. I've read about them, but have not used either one in practice. I had prior colleagues who liked Shapely values in the context of Marketing Mix Models, but at that time I chose not to use them in MMM. Both Shapely values and PFI, as you've described them, seem like good techniques to describe the importance of variables, as defined by changes in the prediction. On the other hand, a good MMM estimate of media impacts should consider many statistical concepts, not just the data fit. In your example you bring up "over fitting" and show how it either makes the Shapley values wrong, or at least, limits how they should be interpreted. Over fitting is a broad term that could represent several different underlying problems, and to add other reasons why a good fitting model could still be representing untrue relationships, we could throw in endogeneity, serial correlation, or differences in predictors' granularity, to name a few. IMO, Shapley values offer some insights, but rely on the fidelity of the prediction to assign the importance values. Therefore any underlying problems the model had will impact the Shapley values also.

回复
?? Alastair Muir, PhD, BSc, BEd, MBB

Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization | Causal Inference

1 年

https://mindfulmodeler.substack.com/p/shap-is-not-all-you-need. Sauce. Substack was down when I reposted

回复
?? Alastair Muir, PhD, BSc, BEd, MBB

Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization | Causal Inference

1 年

Christoph Molnar has published extensively on this and much more on ML and analytic techniques

回复
Andreas Wagenmann

Freelance Senior Data Scientist & ML / Data / Search Engineer | Speaker | Coach

1 年

Why not reference the authors article directly rather than creating a complete repost as separate article which lists the one who copied as author without any contribution?

回复

要查看或添加评论,请登录

?? Alastair Muir, PhD, BSc, BEd, MBB的更多文章

社区洞察

其他会员也浏览了