Beyond SHAP Values and Crystal Balls
In a previous post at EconometricSense?Big Data: Don't throw the baby out with the bathwater,?I discussed how correlations in big data can be useful:
"?...correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships"
I also discussed some important points made by Tim Harford:
"But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down...The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever."
In an issue of Corn and Soybean Digest a few years back, Dan Frieberg actually provided a great example that illustrates what can go wrong with a theory free analysis of correlations. It is a literal example where we don't want to move fast and break things without the guidance of subject matter expertise. In the example, correlations indicated that faster planting speed was associated with higher yields. But something else is going on:
"Sometimes, a data layer is actually a “surrogate” for another layer that you may not have captured. Planting speed was a surrogate for the condition of the planting bed.?High soil pH as a surrogate for cyst nematode. Correlation to slope could be a surrogate for an eroded area within a soil type or the best part of the field because excess water escaped in a wet year." (i.e. better soil conditions were confounded with planting speed as illustrated in the directed acyclic graph (DAG) below creating a non-causal correlation indicated by the broken green line between planting speed and yield)
(Created by https://www.dagitty.net/ )
Causality isn't just theory. Non-causal correlations can get us into trouble, and in the case of corn planting, acting on these kinds of correlations could be yield robbing!
Those kind of confounding and surrogate relationships are the kinds of things that Scott Lundberg illustrates in his more recent article on SHAP values: Be Careful When Interpreting Predictive Models in Search of Causal Insights. (link -Lundberg is author of the SHAP library in python) .
He provides a use case where a predictive model is used to predict product renewals based on a number of features. It turns out that by looking at the SHAP values we see that an increase in bugs reported is associated with increased renewals! Similar to Frieberg's example with planting speed, bugs reported are a poor proxy for actual bugs faced in a software product (not in the data), and actual bugs faced, are a function of monthly usage which turns out to be a surrogate for product need (also not in the data). Sharing just part of Scott's DAG:
(Created by https://www.dagitty.net/ )
He concludes: "Because we can't directly measure product need, the correlation between bugs reported and renewal combines a small negative effect of bugs faced and a large positive confounding effect from product need...the predictive model captures an overall positive effect of bugs reported on retention (as shown with SHAP), even though the causal effect of reporting a bug is zero and the effect of encountering a bug is negative"
领英推荐
Scott also points out that regularized machine learning models like XGBoost will tend to build the most parsimonious models that predict the best with the fewest features necessary (which is often something we strive for). This property often leads them to select features that are surrogates for multiple causal drivers which is "very useful for generating robust predictions...but not good for understanding which features we should manipulate to increase retention."
Sometimes we just need a prediction. That's all. And sometimes we want to know at least what is driving those predictions at a high level as a gut check before operationalizing. SHAP values are great for that! But there are cases where we want to not just predict and explain which variables are moving the needle the most in a correlational sense, but we also want to know what to do about it. What can we change to improve a metric or outcome? That is when we need a causal framework. Because as Judea Pearl says in the Book of Why:
"Causal Analysis is emphatically not just about data; in causal analysis we must incorporate some understanding of the process that produces the data and then we get something that was not in the data to begin with."
This is a point also made by Laura Balzer and Maya Peterson in their article Machine Learning in Causal Inference - How Do I love Thee? Let Me Count the Ways:
"...ML algorithms must be carefully integrated within a formal framework for causal and statistical inference...Background knowledge remains the foundation of causal identification and ML cannot uncover cause-and-effect if this foundation is weak"
In that article they refer to the Causal Roadmap for incorporating the necessary foundation and framework to help guide us to make interpretations of machine learning algorithms. The first two steps involve understanding the business question and specifying a causal model or DAG. These steps can easily integrate into typical machine learning lifecycle processes like CRISP-DM's business and data understanding phases. Lundberg's brief analysis actually gives a preview of what that could look like combining the DAG he creates with double machine learning.
I really like the advice Dan Frieberg gave in the Corn and Soybean Digest article:
"big data analytics is not the crystal ball that removes local context. Rather, the power of big data analytics is handing the crystal ball to advisors that have local context"
Data science professionals at the end of the day aren't alchemists that can turn data into gold. They aren't able to build models that serve as crystal balls. But to Frieberg's point, they combine their insights with the knowledge and expertise of stakeholders and provide them with a framework for making better decisions under uncertainty. That is the future of data science and AI. As economistTyler Cowen says,?the ability to interface well with technology and use it to augment human expertise and judgement is the?key?to success in the new digital age of big data and automation.
References:
Data Decisions: Meaningful data analysis involves agronomic common sense, local expertise. Dan Frieberg. Corn and Soybean Digest, April 15, 2014.
Laura B Balzer, Maya L Petersen, Invited Commentary: Machine Learning in Causal Inference—How Do I Love Thee? Let Me Count the Ways,?American Journal of Epidemiology, Volume 190, Issue 8, August 2021, Pages 1483–1487,?https://doi.org/10.1093/aje/kwab048
Petersen, M. L., & van der Laan, M. J. (2014). Causal models and learning from data: integrating causal modeling and statistical estimation.?Epidemiology (Cambridge, Mass.),?25(3), 418–426. https://doi.org/10.1097/EDE.0000000000000078
Data Scientist Lead | Business value creation with AI/ML | Servant Leader | Healthcare Innovator
3 年Matt Bogard , enjoyed the read! Couldn’t agree more that by the time we are using SHAP (and the likes), we already have pre-selected features that might have excluded the potential underlying “causal” features. But when facing a complex system, such as our health, we, as data servants, could hardly pinpoint the potential “causal” features that should be involved as the complexity is so overwhelming and hard to describe. I think that is what limits us to take more appropriate strides in the causal efforts.