Demystifying Causal AI : The why of Why
Introduction
Correlation is not Causation
The above statement has been drilled in to most of us, and certainly if we took Statistics 101 in school. However nothing much has been done to rigorously define Causation. No book of statistics talks about causes, only associations. It is almost as if we have religiously avoided this topic in our scientific discourse until recently. The "world is unknowable" is almost taken as canon. This should be modified to :
Given certain assumptions and a causal model, one can reason about the world and take decisions with a quantifiable degree of confidence.
This article attempts to demystify the causal revolution and make accessible Causal Concepts to everyone. I will be exploring "The why of Why"; why do we need to know the Why of the world. It is not just a play of words. It is about investigating rigorously (mathematically) cause and effect relationships. In other words explore "the calculus of causation".
Overview
Humans have an innate ability to discern causation from the time Homo Sapiens evolved. In fact this ability to figure out the cause and effect has helped us humans be the apex predator and at top of the food chain. To illustrate this point, we know that when we hear a rooster's crow, it signals the start of a new day as witnessed by sunrise. However even a child can decipher the cause and effect relationship implied here.
The rooster's crowing doesn't cause the sunrise. They are just highly correlated.
How can we test if this is true? Make the rooster not crow and see if the sun doesn't rise!!!!
Sunrise --> Rooster Crowing NOT Rooster Crowing --> Sunrise.
So there is a directionality implied here; from cause to effect.
To all of us this is no big revelation. You might be wondering why I am belabouring an obvious fact. This is so because we all have a shared causal model of rooster crowing and sunrises. Thats why we all agree on the above statement. If our causal models are different then it will not be the case. For e.g. Taking Vitamin C in large doses not only cures Covid, but prevents it. If our causal models (opinions?) differ, we will not agree. In such cases we turn to data and believe the answer is in the statistical mining of it. This I believe is flawed. We can only determine correlations (associations) but not causation from data alone.
Data doesn't contain Causes and Effects. However, once you have defined a causal model, data can be used to affirm or repudiate it rigorously.
This means, we start with a causal model and then go to data, not the other way around. This insight is the foundation of Causal AI. Further a causal model allows us to intervene on certain input variables and allows us to estimate the effect on the outcome. Interventions are of the form "What if I do...". Moreover we can ask counterfactual questions of the model which are in the form of "What if I had...". Traditional machine learning and even Generative AI do not allow us to do so.
Now that we have set the stage, the rest of the article takes you through what is known as the "The Ladder of Causation" : Associations --> Interventions --> Counterfactuals
Before we embark on the causal journey we need to get some causal concepts out of the way. These are the building blocks that are required to navigate the causal landscape.
Causal Concepts
There are five main causal concepts : Causal Graph, Causal Discovery, Causal Inference, Counterfactual Reasoning and Mediation Analysis. Let us examine each one in turn.
Causal Graph
Let us first formally define: What is a Cause?
A variable X is said to be a cause of variable Y, if Y changes in response to change in X.
Alternatively as Judea Pearl (Turing Award winner & author of Book of Why) puts it :
X is a cause of Y, if Y listens to X.
A causal graph is a Directed Acyclic Graph (DAG) where the edges between the nodes represent causal dependencies.
We encode X as a causal driver of Y using a structural equation. We say Y is some function of X and noise Uy (which are also called exogenous variables). Similarly we say Z is some function of X and Y and noise term Uz. One thing to note is that these functions are not invertible, in the sense that we cannot write X as a function of Y. Causality is strictly directed.
Recall that Causal Graphs are DAGs: Directed because they only contain directed edges; Acyclic because the effect cannot also be the cause of its parent; graph because it is a graphical model.
There are three possible junctions in a Causal Graph : Confounder, Mediator and Collider.
The type of the junction between variables has an impact while doing interventions or asking counterfactual queries. Let us examine a few examples of each junction to get a better grasp of it causally.
Confounder : Barometer Reading <-- Atmospheric Pressure --> Possibility of Wind & Rain.
Here the confounder is atmospheric pressure. It is the common cause for barometer readings as well as rain/wind. The impact of this is, it makes Barometer reading highly correlated with possibility of wind/rain. A falling barometer reading indicates a possibility of bad weather. However there is no direct causal link between the two. The effect is confounded by pressure!!!
The impact here is X and Z will be highly correlated and when we control for Y this spurious correlation disappears.
Mediator : Fire. --> Smoke --> Alarm
We have smoke alarms that alert us in the presence of smoke (Don't know why they are called 'Fire alarms' :) ). The causal 'chain' is Fire causes Smoke which in turn triggers an Alarm. So fire per-se doesn't cause the alarm to ring; if no smoke was generated. So in effect the Smoke transmits the effect of Fire to the Alarm. If the fire did not emit smoke, but only heat then the alarm won't go off.
Here too X and Z will be correlated, but when we control for Y, the correlation disappears.
Collider : Managerial Expertise --> Start Up's Profitability <-- Market Competition
Expertise and Competition both are a direct cause of a start up's profitability. They are not correlated in the general population. However if an investor in the start up wants to understand the effect of these two causes on profitability and conditions on it, this will introduce a spurious correlation between expertise and competition. This is also called as collider bias.
The collider is special because it has the opposite effect of the above two. X and Z will not be correlated, however when we control for Y, a correlation appears between X and Z.
Causal Discovery
One way to create a causal graph is to get domain experts to create it. While this is possible for simple causal graphs like the ones we use in this article, it soon becomes intractable if there are a lot of variables.
An alternative way is to discover a causal graph from observational & experimental data using causal discovery algorithms. However we might not get a fully specified graph. We might still need domain experts to fill in the gaps or point out discrepancies from a domain perspective. At least domain experts don't need to start from scratch.
There are a lot of causal discovery algorithms in literature and is a subject of ongoing research. Here we will go through some of the main ones.
Constraint Based Methods : We use conditional independence (CI) tests with p-values to discover the causal structure. These methods are well known and can even in some cases find latent confounding variables. Fast Causal Inference (FCI) and Peter-Clark (PC) are examples of this method.
Score Based Methods : We use a score based evaluation of potential causal graphs. Bayesian Information Criterion (BIC) is one such score. Essentially we score each potential causal graphs (which might have been found by CI tests) and essentially take the one that is best. Generally lower the BIC, better is the causal graph. A* and Greedy Equivalence Search (GES) are examples of this method.
Causal Inference
Causal Inference is about inferring interventional distributions. It is the second rung of causation.
It is similar to doing A/B testing without actually doing the test!!!. Doing A/B testing is quite expensive and in some cases down right unethical. For e.g. if we need to find the efficacy of a treatment (T) with a drug (D) on the lifespan (L) of a patient. We might not know prima facie if the treatment T causes adverse reaction. Even in cases if life threat is not present, it could be quite expensive. For e.g. if we want to measure the increase in revenue (R) if we increase the marketing budget (B) to 20 million say. If the A/B test fails to increase the revenue then 20 million is down the drain. Wouldn't it be great if we can answer this questions just by using a causal graph, interventions & observational & experimental data? Well causal inference helps you do just that!!!!
Before we embark on this, we need to know about the "do operator" and conditional probability. I will try to explain this in a light weight manner, however it will be heavy going for mathematically challenged folks.
First let me introduce the do operator and contrast it with conditional probability.
Let us use the previous setup of a treatment (T) used to increase the Lifespan (L) which is the outcome we want to infer.
Mathematically we can use the following notation to represent the above:
Conditioning : P( L | T= 1) => What is probability of Lifespan (L) conditional on being given the treatment (T=1)
Intervening : P( L | do(T=1) ) => What will be probability of Lifespan (L) if we set (do) the treatment to all.
From the above we can calculate the Average Treatment Effect (ATE) : Average of one intervention minus the average of another intervention.
It answers the causal question : " What is the effect of doing a treatment compared to not doing a treatment?"
Causal Inference allows us to infer results of interventions from observational & experimental data without actually doing it.
In this article we won't go in to the do-calculus referenced above. Suffice it to say there is a strong mathematical basis for arriving at this causal inference. Do-Calculus uses something called Back-door and Front-door criterion by way of Adjustment Sets, to arrive at the causal inference, the input is the Causal Graph defined above. Essentially we de-confound the confounder variables to determine the causal effect.
In a future article I will go in depth on how to do interventions.
Counterfactual Reasoning
This is the third rung of causation. First let us define what a counterfactual is:
Thinking?about what did not?happen?but could have?happened, or?relating?to this?kind?of?thinking - Cambridge Dictionary
Contrary to fact - Merriam-Webster Dictionary
So a asking a counterfactual question is about imagining a hypothetical world wherein we are estimating "What would have happened if ....?" More formally, we ask:
If X had not occurred, would Y not have occurred, given in the real world we observed that X had occurred and Y had occurred.
To put it in an motivating example: If we observed that taking Aspirin (X) removed my headache (Y). A counterfactual question would be: "If I had not taken Aspirin (X), would my headache still be there?"
As you can see, these kind of queries belong to the third rung of causation. Where we have to imagine a hypothetical world and try to get answers from the results that occurred in the real world.
This is in contrast to Interventions because we cannot estimate the effect of a cause which did not occur. However the good news is we can estimate counterfactuals if we have a parametric Structural Causal Model (SCM). Please note that we need an SCM only to answer counterfactuals at a unit (individual) level. However we can answer counterfactuals at a population level without a strong assumption of a parametric model.
Now that we have descriptive definition of a counterfactual, let us also go ahead and put some math behind this for completeness sake, as well as to drive home the point that we can estimate counterfactuals.
Now that we have framed counterfactuals in both descriptive and mathematical form, let us go ahead and see how we can derive counterfactuals from a structural causal model.
This is as per The Book of Why : Judea Pearl. Chapter : Counterfactuals : Mining Worlds That Could Have Been, Page (278)
领英推荐
Step 1 : (Abduction) : Use an observation for an individual; X=x and Y=y and determine the idiosyncratic factor U (exogenous variable), that makes the individual unique.
Step 2 : (Action) : We use the above to get the individualised SCM for the individual. Then we use the do-operator to modify the model to reflect the counterfactual assumption being made, in this case do(X=x')
Step 3: (Prediction) : Use the value of U from Step 1 and the modified model (SCM) from Step 2 to compute Y(x').
Note : We only need a parametric SCM if we need to estimate unit (individual) level counterfactuals. If we generalise to a population level counterfactuals,, this strong assumption can be relaxed.
Will do a complete worked out example in a future article.
Mediation Analysis
If you lasted till here, my hearty congratulations. We are now in a position to explain the title of this article: The why of Why?
As per Judea Pearl, there are two versions of Why?: One is quite straightforward. You see an effect and you want to know the cause. Why did this effect happen?
The second one is more involved. We want to better understand the connection between "cause & effect". Why does this known cause, have the observed effect? In effect (no pun intended) we want to know the "mechanism" through which the cause transmits the effect to the outcome. This is not a trivial problem. It needs Mediation Analysis and is due to the Mediator junction seen earlier.
Let us take a example from Judea Pearl's The Book of Why. Here I will paraphrase his words in to a succinct form. This example, along with others have been described in detail in his book.
Scurvy was a bane of sailors for long, till around 1800. Captain James Lind's study of scurvy published in 1747 conclusively established Citrus Fruits --> Scurvy (i.e the treatment of citrus fruits prevented scurvy). This causal relationship though important was not much used after sometime since people did not know Why? citrus fruits prevented scurvy. The reason was we were missing the mediator. To cut a long story short the true mediator was discovered in 1930 by Albert Szent-Gyorgyi as Vitamin C.
Citrus Fruits --> Vitamin C --> Scurvy
Once we knew the right mediator, we can for example, use Vitamin C in other forms if citrus fruits are not available. This is straight forward mediation analysis where the direct effect of citrus fruit was minimal to none and the whole effect was through indirect effect (mediator) Vitamin C. This estimation of Total, direct and indirect effect is mediation analysis.
In a future article I will explore this in much more detail. There is a lot more to mediation analysis than what I have shared above. But it is sufficient to get the point about why mediation analysis deserves its own time in the spotlight.
Why current Machine Learning cannot answer Causal questions?
In order to appreciate why traditional machine learning is not only unhelpful, but dangerous while trying to answer causal questions, let us examine an obvious causal relationship.
A barometer (B) is a scientific instrument which is used to measure atmospheric pressure (P). Now atmospheric pressure is an indicator of Weather (W). Changes in P affect W. Meteorologists use barometer to predict short term changes in weather. A rapid drop in P as indicated by a drop in B usually means it will be cloudy, rainy or windy. We all know (a shared causal model) that P --> B not the other way around. i.e
B := kP + n
where k is the coefficient of P and n is a bias term.
Now for a moment let us forget the above. We have data about barometer readings (B) and we observed the weather (W) (0 for (windy, cloudy or rainy and 1 for good weather). Let us say we are in the business of providing short term weather forecasts. We also hired an in-house data scientist to help us with this.
Our in house data scientist comes back with a strong correlation between B and W and confidently says " B causes W", instead of saying B is highly correlated with W. This gives rise to a spurious correlation because of the confounding variable pressure (P) which is the cause for both B and W.
No amount of machine learning will help us with causality. We can only determine correlations and associations. In the setup above if B changes for whatever reason other than responding to changes in P, your machine learning model will predict a W that doesn't correspond to reality. Data Scientists know about this. Instead of accepting the limitations, they have given names to it For e.g. Concept Drift, Data Drift, Covariate Shift, Regime Change etc. They hide behind these terms, instead of acknowledging a fundamental issue with their models being causality free.
Let us see how Causal AI fixes this issue. First we should determine or discover a causal graph. Then using do-calculus and Structural Equations we can truly ask causal questions.
As you can see from above, even for a univariate case and a simple regression model, we are not able to discern the causes. Imagine if we have a multi variate case with confounders, mediators and colliders in a Causal Graph, but machine learning model will only know that they are 'features' with no causal knowledge. Even the current xAI (explainable AI) trend goes to the extent of 'explaining' correlations rather than causality. Believe me I have done enough of xAI to know that we cannot get causal explanations only correlations.
Coming to Generative AI, at best it can be used as a data generator for Randomised Controlled Trials (RCT) which are inputs to Causal Analysis and inference. GenAI also suffers from explainability. It cannot infer causes, unless we provide the foundational LLM with cause and effect (one-shot and few-shot) examples while training or in a prompt (with very creative and elaborate prompt engineering). They definitely cannot answer counterfactual queries with any degree of certainty or fidelity, which are an imperative for enterprise level use cases. Here too I did enough of genAI to know that it is at best system 1 thinking. We need to push it hard through clever prompt engineering and pre-training, for it to do an approximation of system 2 thinking. Definitely not causal!!!
All is not lost though. We still need Classical ML, Deep Learning and GenAI. They will be inputs to Causal AI and also consumers of the output. There are already some papers which are exploring use of LLMs in Causal AI and the reverse; Causal AI's output used to enhance LLMs. My favourite though is the advent of Causal Agentic AI which amplifies and enhances both disciplines.
Real World Use Cases
Let us now see where we can apply Causal AI to get answers to interventional and counterfactual queries in the real world. This is obviously not an exhaustive list. These are just some of the use cases that a majority of us can empathise and understand.
Root Cause Analysis
Root Cause Analysis (RCA) is the process of identifying underlying causes of issues, rather than just treating symptoms. The goal of RCA is to avoid future problems or enable repeatable success. One can find use of RCA in large manufacturing industries with many processes and components that act in concert. RCA is also useful for IT and security teams to address issues and prevent future occurrences.
Correlation-based approaches fail in accurately identifying root causes for several reasons. First, they can't account for scenarios where even minor changes in one variable significantly impact another. Second, they fail to recognise confounding relationships between a root cause and an outcome, which can lead to incorrect estimations of the root cause's impact on an outlier event.
These issues can be resolved by using a causality-based approach to Root Cause Analysis (RCA). This approach focuses on understanding and identifying true causal relationships, enabling it to capture significant effects and account for confounding factors more effectively.
Marketing Spend Optimization
In todays world where marketing spend is sky rocketing, it is very important we know what drives revenue. With numerous factors influencing your revenue—such as timing of sales and promotional events, inventory levels, discounts given, marketing campaigns, profit margins etc—it can be challenging to identify the key drivers.
We can utilise causal discovery, structural causal model and domain experts to uncover the cause-and-effect relationships between these factors and revenue. By understanding these underlying dynamics, one can make informed decisions and adjust strategies for future periods, keeping one ahead of the competition.
Member Reward Program Effectivity
To determine if a membership rewards program is effective, we need to evaluate its impact on total sales. The key causal question is:
What is the impact of offering the membership rewards program on total sales?
The equivalent counterfactual question is:
If the current members had not signed up for the program, how much less would they have spent on the website?
In formal terms, we are interested in the Average Treatment Effect (ATE). The ATE measures the difference in outcomes between those who received the treatment (membership rewards program) and those who did not, averaged over the population.(See above on how it is calculated).
Hotel Booking Cancellation
There can be various reasons for booking cancellations. A customer might cancel because their request was not fulfilled (e.g., no wi-fi), they discovered later that the hotel did not meet their requirements, or they canceled their entire trip. Some factors, like wi-fi, are actionable by the hotel, while others, like trip cancellation, are beyond the hotel's control. In any case, it's important to understand which of these factors contribute to booking cancellations to improve customer satisfaction and reduce cancellation rates.
Preventing Customer Churn
In most subscription based businesses, customer churn is a major problem. We are frequently asked to come up with answers to questions like "If we reduce subscription cost by 10% what is the likelihood of retaining customers?" or a counterfactual question like "If we had not increased the price would we have retained this customer who unsubscribed?"
Supply Chain Optimisation
Supply chain is ideal for Causal AI. We can generate a Causal Graph quite easily as it is already a network of nodes. We just have to make it in to a DAG with causal arrows and add structural equations to make it a Structural Causal Model (SCM). There is also no dearth of data flowing through this nodes. This opens it to the entire power of Causal AI, especially answering counterfactuals which are hard to see or expensive to gather in the data. Here are some counterfactuals and interventional queries we can potentially answer:
What if we reduced the lead time for our primary suppliers by 50%?
What if we doubled our safety stock levels?
What if we sourced raw materials from multiple suppliers instead of relying on a single supplier?
What if we implemented a just-in-time (JIT) inventory system?
What if we switched to a different transportation mode for our deliveries (e.g., from trucking to rail)?
What if we centralised our distribution centers instead of having multiple regional centers?
What if we adopted a circular supply chain model focusing on recycling and reusing materials?
Causal Factor Investing
Institutional investors trade using a model specified using a set of factors (technical and other indicators). However the factors used in the model are specified using associations rather than causality. This gives rise to misspecification of the model leading to large scale losses when the model is used to inform trades. The statisticians, data scientists and quants frequently mistake noise for signal. They need to be constrained to produce unfalsifiable ad-hoc and ex-post rationales. This is part of the scientific method which this industry needs to adopt to overcome the inherent flaws in generating a prediction. Causal discovery and Causal Inference should be an integral part of factor investing. Here are some counterfactuals and interventional queries we can potentially answer:
What if we increased our exposure to momentum factors by 50%?
What if we eliminated all low-volatility stocks from our portfolio?
What if we focused exclusively on value factors and ignored growth factors?
What if we doubled our allocation to small-cap stocks?
What if we reduced our exposure to dividend-yielding stocks by 75%?
See in references the paper by Dr Marcos Lopez De Prado for more details.
Conclusion
For almost a decade or so domain experts have been shut out of the decision making process because of big data, data science, deep learning and recently generative AI. Causal AI marks the beginning of return of the domain experts and causal explainability. CXOs job is to ask and figure out answers to interventional and counterfactual questions. They have been relying on intuition and trying to divine the answers from data for a while. Now with Causal AI reaching a level of maturity and acceptance with respect to theory and tooling, it is imperative that we embrace Causal AI.
I conclude with the very insightful : A fireside chat with Judea Pearl; The Causal AI Conference 2024 which I was fortunate to attend in person.
References
The Book of Why : The New Science of Cause & Effect - Judea Pearl & Dana Mackenzie
Causation, Prediction and Search - Peter Spirtes, Clark Glymour, and Richard Scheines
An Anytime Algorithm for Causal Inference - Peter Spirtes
Optimal Structure Identification With Greedy Search - David Maxwell Chickering
Learning Optimal Bayesian Networks:A Shortest Path Perspective - Changhe Yuan & Brandon Malone
The Case For Causal Factor Investing : Dr Marcos Lopez de Prado
Declaration: This blog has been written by a human author, not generated by artificial intelligence. All content, including analysis, insights, diagrams and opinions, are the result of the author's own research, reading and expertise. Sources have been credited where appropriate. (Only the banner image was generated by AI)
Co-founder and COO | Xebia Product Engineering
3 个月Very well written Sai. I like this example ?? cause vs reason (or not) “ Sunrise --> Rooster Crowing NOT Rooster Crowing --> Sunrise”