SHAP is not all you need (or why you should always use permutation feature importance)
?? Alastair Muir, PhD, BSc, BEd, MBB
Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization | Causal Inference
Repost from Christoph Molnar
A most annoying misconception in the world of machine learning interpretability
This post is
I just got a paper rejection.
The paper itself fills a theoretical and conceptual gap: While ML interpretation techniques such as partial dependence plots and permutation feature importance primarily describe the model, many (data) scientists use them to study the underlying data and phenomenon. Our paper discusses what’s needed to actually achieve the jump from model to data.
But that’s not what’s important today. Maybe I’ll explain the paper in another post.
Today I want to talk about a part of the criticism we received for the paper. Here are two quotes:
The reviewer draws the conclusion that ‘I do not see much value in the analysis of somewhat "inferior" feature-analysis methods [like PDP and PFI]’.
If you take this statement to its full conclusion everyone would have to stop working on PDP and PFI. And while we are at it, why not drop ALE plots, ICE plots, and counterfactual explanations and write yet-another SHAP extension paper?
The critic of the reviewer is wrong on at least two levels:
If it were the first time that someone said that SHAP is all you need, it wouldn't be worthy of a post. But especially in “peer” review, the critique “You should be working on Shapley values / SHAP / LIME” was surprisingly common. And also elsewhere I often saw people with the attitude of “SHAP is all you need”.
It’s wrong and I’ll show why.
Short primer on SHAP and PFI
If you are already familiar with SHAP and PFI, just skip this section.
Let’s start with permutation feature importance, because this is one of the simplest interpretability methods to explain. It’s a model interpretation technique that assigns an importance value for each feature. The importance is computed as how much the model performance would drop if we shuffle a feature. The more the performance drops (aka loss increases), the more important the feature was for correct predictions.
Compute loss. Permute feature. Compute loss again. Compute difference. Simple.
SHAP is a method to compute Shapley values for machine learning predictions. It’s a so-called attribution method that fairly attributes the predicted value among the features. The computation is more complicated than for PFI and also the interpretation is somewhere between difficult and unclear.
领英推荐
SHAP produces many types of interpretation outputs: SHAP can be used to explain individual predictions (aka attributions). But if you compute Shapley values for all the instances in your data, you can also aggregate them. Then you get good-looking plots that show you some notion of feature dependence, some notion of feature importance, and some notion of feature interactions. All these notions are of course tied to the not-so-easy interpretation of Shapley values. For an overview of the plots, you can check out my SHAP Plots For Tabular Data Cheat Sheet.
SHAP Is Not All You Need
Believing that SHAP is all you need is a typical pitfall: assuming that 1 method is the best for all interpretation contexts.
Let’s walk through my favorite example for showing how SHAP importance can be inadequate.
An xgboost regression model was trained on simulated data. But all of the 20 features were simulated to have no relation with the target. In other words, any type of relationship that the model picks up is the result of overfitting. And for this experiment, we overfit the model on purpose because in this case PFI and SHAP will diverge quite drastically.
The example is from our paper on ML interpretability pitfalls:
Clearly, SHAP and PFI deviate in the bar plot above. PFI more or less shows that all 20 features are unimportant. But SHAP importance clearly shows that some of the features are important.
Which interpretation is the correct one?
Given the simulation setup where none of the features has a relation to the target, one could say that PFI results are correct and SHAP is wrong. But this answer is too simplistic. The choice of interpretation method really depends on what you use the importance values for. What is the question that you want to answer?
Because Shapley values are “correct” in the sense that they do what they are supposed to do: Attribute the prediction to the features. And in this case, changing the “important” features truly changes the model prediction. So if your goal tends towards understanding how the model “behaves”, SHAP might be the right choice.
But if you want to find out how relevant a feature was for the CORRECT prediction, SHAP is not a good option. Here PFI is the better choice since it links importance to model performance.
In a way, it boils down to the question of audit versus insight: SHAP importance is more about auditing how the model behaves. As in the simulated example, it’s useful to see how model predictions are affected by features X4, X6, and so on. For that SHAP importance is meaningful. But if your goal was to study the underlying data, then it’s completely misleading. Here PFI gives you a better idea of what’s really going on. Also, both importance plots work on different scales: SHAP may be interpreted on the scale of the prediction because SHAP importance is the average absolute change in prediction that was attributed to a feature. PFI is the average increase in loss when the feature information is destroyed (aka feature is permuted). Therefore PFI importance is on the scale of the loss.
A fallacy of the reviewer was to equate these different ideas of feature importance.
Unfortunately, this points towards a much larger issue in research on interpretability. The field is more method-driven than question-driven. We first develop methods, and then ask “what question do the methods really answer?”.
For SHAP, it’s not so easy to answer how the Shapley values are supposed to be interpreted.
Shapley values are also expensive to compute, especially if your model is not tree-based.
So there are many reasons not to use SHAP, but an “inferior” (as the reviewer said) interpretation method.
For another critique of Shapley values I recommend this post by Giles Hooker.
Follow me, I am a professor ?? (private account/all views my own)
1 年I still like LIME a lot, see my post: https://blog.ephorie.de/explainable-ai-xai-explained-or-how-to-whiten-any-black-box-with-lime
CEO @ Goal Aligned Media | Digital Marketing, Advanced Analytics
1 年Alastair, I liked your short primer on Shapley values and PFI. I've read about them, but have not used either one in practice. I had prior colleagues who liked Shapely values in the context of Marketing Mix Models, but at that time I chose not to use them in MMM. Both Shapely values and PFI, as you've described them, seem like good techniques to describe the importance of variables, as defined by changes in the prediction. On the other hand, a good MMM estimate of media impacts should consider many statistical concepts, not just the data fit. In your example you bring up "over fitting" and show how it either makes the Shapley values wrong, or at least, limits how they should be interpreted. Over fitting is a broad term that could represent several different underlying problems, and to add other reasons why a good fitting model could still be representing untrue relationships, we could throw in endogeneity, serial correlation, or differences in predictors' granularity, to name a few. IMO, Shapley values offer some insights, but rely on the fidelity of the prediction to assign the importance values. Therefore any underlying problems the model had will impact the Shapley values also.
Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization | Causal Inference
1 年https://mindfulmodeler.substack.com/p/shap-is-not-all-you-need. Sauce. Substack was down when I reposted
Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization | Causal Inference
1 年Christoph Molnar has published extensively on this and much more on ML and analytic techniques
Freelance Senior Data Scientist & ML / Data / Search Engineer | Speaker | Coach
1 年Why not reference the authors article directly rather than creating a complete repost as separate article which lists the one who copied as author without any contribution?