The quest for the "why" that matters: an overview of causal inference in modern statistics
Today, being data-driven is mandatory, but sometimes even a simple data-driven analysis may result completely misleading. Why? Because analysts look for correlations but human cognitive process looks for cause-effect relationship; this is wrapped up in the mantra "Correlation is not causation". Let's look at a couple of toy examples:
Hence, how we can actually shift the attention from correlation towards causation to correctly interpret real causes behind the business phenomena that matters the most in each industry? That's when the causal inference comes into place.
Everyday in Enel Global Infrastructure & Networks, we collect an outstanding amount of data from our operations and we are using the most advanced Artificial Intelligence and Machine Learning technologies to analyze them: deep neural networks to identify anomalies in assets' images, predictive analytics to estimate energy flows or the mean time to failure of any line segment or grid component. In numbers, every single day we receive more than 60,000 contacts from our customers, we collect 10 billion measurements from sensors and our grid records more than 20,000 events while 8 Terabytes of data from visual inspections.
Beyond Artificial Intelligence and Machine Learning, sometimes, our business questions can be easily expressed by the following question: "why?", why a given event happens and what is caused by; unfortunately the answer is not so straightforward...
Why Mr. Anderson?
Among all the 5 Ws journalistic questions (i.e., Who, What, When, Where and Why), the question "why" is generally the most controversial one, not being directly observable: looking at a recorded video, one can straightforwardly identify who did what, at a given time and place, but it is not easy to identify "why". Even in one of the most nerdy movie like Matrix, such a concept is highlighted in the dialogue with the architect or by Agent Smith insistently asking: "Why, Mr. Anderson?, Why, why?. Why do you do it?"
Considering the nature and the power of such a question, what about trying to formalize it? Is it possible to leave apart ambiguity and complexity of natural language towards a formal one? What about creating a mathematical language for causation??
Most of the modern statistics was focused on pure data observation, but sometimes data may be misleading resulting in actual Data-Driven Paradoxes...
Data-Driven Paradox
At the beginning of the Covid-19 outbreak, everybody was analyzing data to assess the case fatality rate and to compare among different countries.
A recently published IEEE Transaction paper (arXiv:2005.07180) deeply analyzed the comparison between fatality rate in Italy and China.
And here is when the statistical magic happens:
... conversely ...
Consequently, you are completely puzzled! How can be Covid-19 deadlier in Italy when you are looking to the overall population and, concurrently, deadlier in China with respect to each age range? Welcome to the Simpson's paradox!
领英推荐
The reason is that the countries' age distribution are different and they are strongly affecting the comparison of the mortality rate: the paper goes through a mathematical description by highlighting age as a confounding element. But why such an analysis worth an IEEE Transaction in 2021? Because it is not about statistics, it is about causation analysis, an approach that has been neglected from more than a century in mathematics, let's go through such a story...
From regression back to causation
At the dawn of modern statistics in 1885, Sir Francis Galton studied physical characteristics of human beings describing how sons of tall men tend to be taller than average - but not as tall as their fathers, sons of short men tend to be shorter than average - but not as short as their fathers. Galton first called this phenomenon "reversion" and late "regression toward mediocrity": Galton explained it through a straight line showing the relationship between fathers' heights and sons' heights and creating the very first regression line of the statistics.
Galton conjectured, regression toward the mean was a physical process, nature's way of ensuring that the distribution of height remained the same from generation to generation. Galton had proven only that one phenomenon - Regression to the mean - did not require any particular causal explanation. Later on Pearson extended such a concepts stating that causation is only a matter of repetition and, in the deterministic sense, can never be proven. Making a long story short, Pearson, the father of modern statistics, completely removed causation from science.?
(Pearson belonged to a philosophical school called positivism, which holds that the universe is a product of human thought and that science is only a description of those thoughts. Thus causation, construed as an objective process that happens in the world outside the human brain, could not have any scientific meaning).?
Later in the 1920s, Sewall Wright (the father of quantitative genetics) was the first one to develop a mathematical method for answering causal questions from data, known as path diagrams, being the corner stone of modern causal inference.?Wright focused on population genetics and the path diagrams have been colors of guinea pigs and hereby there is one of the most famous diagrams in the history of causal inference.
Unfortunately, statistics world was still strongly influenced by Pearson's positivism, and the path from regression back to causation analysis, took nearly a century: this path required the research activity of Judea Pearl (2011 Turing Award)...
"Lucky is he who has been able to understand the cause of things" (Virgil)
"Felix qui potuit rerum cognoscere causas".
The whole causal analysis, is based on three sequential analysis levels, identified as "The Ladder of Causation" by Judea Pearl:
What does mathematics of Level 2 and Level 3 (i.e., the new ones!) consist of? The point is to translate causal questions into statistical quantities, e.g., translate questions like "Why people die more in Italy than China?" into probability calculus over available data.
From a practical point of view, how do you perform such a translation without being affected by weird effects like Simpson's paradox? Briefly
All these steps are referred as do-calculus, and, today, it is implemented by a lot of open-source libraries without requiring any particular statistical know-how.
For more than a century, statistics has avoided one of the fundamental cognitive process of human mind: analysis of cause and effect. Today, for the very first time in the history of mathematics, we have the mathematical tools to analyze data and answer causal questions: it is time to unleash the power of the ladder of causation!?
Energy Industry Consultant |Value Networker and Business Development Specialist | Sales Coach | Fractional Manager| Talks about #change #digitaltransformation #innovation #energytransition and #businessdevelopment
2 年Enlightning ! I believe Donella Meadows (donellameadows.org) would have liked as well
President of CENELEC - Comité Europeén de Normalisation Electrotechnique | President General of CEI - Comitato Elettrotecnico Italiano
3 年Well done Alessio ! Both David Hume and Immanuel Kant would have approved...
Partner | QuantumBlack | QuantumBlack Labs | McKinsey | Artificial Intelligence
3 年Great article Alessio Montone! Very clear explanation of causation