The quest for the "why"? that matters: an overview of causal inference in modern statistics

The quest for the "why" that matters: an overview of causal inference in modern statistics

Today, being data-driven is mandatory, but sometimes even a simple data-driven analysis may result completely misleading. Why? Because analysts look for correlations but human cognitive process looks for cause-effect relationship; this is wrapped up in the mantra "Correlation is not causation". Let's look at a couple of toy examples:

  • Every morning, the cork crows at the sunrise, consequently, the two events are strongly correlated but everybody would accept that corks are not causing the sun to rise;
  • During summer season, there is a higher consumption of ice cream and higher number of sunburns, resulting in a strong correlation between ice-cream consumption and sunburns; again, ice-creams do not cause sunburns but both of them are caused by a higher solar radiation during summer.

Hence, how we can actually shift the attention from correlation towards causation to correctly interpret real causes behind the business phenomena that matters the most in each industry? That's when the causal inference comes into place.

Everyday in Enel Global Infrastructure & Networks, we collect an outstanding amount of data from our operations and we are using the most advanced Artificial Intelligence and Machine Learning technologies to analyze them: deep neural networks to identify anomalies in assets' images, predictive analytics to estimate energy flows or the mean time to failure of any line segment or grid component. In numbers, every single day we receive more than 60,000 contacts from our customers, we collect 10 billion measurements from sensors and our grid records more than 20,000 events while 8 Terabytes of data from visual inspections.

Beyond Artificial Intelligence and Machine Learning, sometimes, our business questions can be easily expressed by the following question: "why?", why a given event happens and what is caused by; unfortunately the answer is not so straightforward...

Why Mr. Anderson?

Among all the 5 Ws journalistic questions (i.e., Who, What, When, Where and Why), the question "why" is generally the most controversial one, not being directly observable: looking at a recorded video, one can straightforwardly identify who did what, at a given time and place, but it is not easy to identify "why". Even in one of the most nerdy movie like Matrix, such a concept is highlighted in the dialogue with the architect or by Agent Smith insistently asking: "Why, Mr. Anderson?, Why, why?. Why do you do it?"

Considering the nature and the power of such a question, what about trying to formalize it? Is it possible to leave apart ambiguity and complexity of natural language towards a formal one? What about creating a mathematical language for causation??

Most of the modern statistics was focused on pure data observation, but sometimes data may be misleading resulting in actual Data-Driven Paradoxes...

Data-Driven Paradox

At the beginning of the Covid-19 outbreak, everybody was analyzing data to assess the case fatality rate and to compare among different countries.

No alt text provided for this image

A recently published IEEE Transaction paper (arXiv:2005.07180) deeply analyzed the comparison between fatality rate in Italy and China.

And here is when the statistical magic happens:

  • looking at the overall fatality rate (rightmost column), Italy has a double rate with respect to China

... conversely ...

  • looking at the fatality rate split by age, is always lower in Italy with respect to China

Consequently, you are completely puzzled! How can be Covid-19 deadlier in Italy when you are looking to the overall population and, concurrently, deadlier in China with respect to each age range? Welcome to the Simpson's paradox!

The reason is that the countries' age distribution are different and they are strongly affecting the comparison of the mortality rate: the paper goes through a mathematical description by highlighting age as a confounding element. But why such an analysis worth an IEEE Transaction in 2021? Because it is not about statistics, it is about causation analysis, an approach that has been neglected from more than a century in mathematics, let's go through such a story...

From regression back to causation

At the dawn of modern statistics in 1885, Sir Francis Galton studied physical characteristics of human beings describing how sons of tall men tend to be taller than average - but not as tall as their fathers, sons of short men tend to be shorter than average - but not as short as their fathers. Galton first called this phenomenon "reversion" and late "regression toward mediocrity": Galton explained it through a straight line showing the relationship between fathers' heights and sons' heights and creating the very first regression line of the statistics.

Galton conjectured, regression toward the mean was a physical process, nature's way of ensuring that the distribution of height remained the same from generation to generation. Galton had proven only that one phenomenon - Regression to the mean - did not require any particular causal explanation. Later on Pearson extended such a concepts stating that causation is only a matter of repetition and, in the deterministic sense, can never be proven. Making a long story short, Pearson, the father of modern statistics, completely removed causation from science.?

(Pearson belonged to a philosophical school called positivism, which holds that the universe is a product of human thought and that science is only a description of those thoughts. Thus causation, construed as an objective process that happens in the world outside the human brain, could not have any scientific meaning).?

No alt text provided for this image

Later in the 1920s, Sewall Wright (the father of quantitative genetics) was the first one to develop a mathematical method for answering causal questions from data, known as path diagrams, being the corner stone of modern causal inference.?Wright focused on population genetics and the path diagrams have been colors of guinea pigs and hereby there is one of the most famous diagrams in the history of causal inference.

Unfortunately, statistics world was still strongly influenced by Pearson's positivism, and the path from regression back to causation analysis, took nearly a century: this path required the research activity of Judea Pearl (2011 Turing Award)...

"Lucky is he who has been able to understand the cause of things" (Virgil)

"Felix qui potuit rerum cognoscere causas".

No alt text provided for this image

The whole causal analysis, is based on three sequential analysis levels, identified as "The Ladder of Causation" by Judea Pearl:

  • Level 1 - Seeing or observing. Data are analyzed in order to detect regular patterns; from an evolutionary point of view, many animals are able to detect this kind of regular patterns among data (i.e., the classical Pearson's statistical approach)
  • Level 2 - Doing. Predicting the effect of deliberate actions on the surrounding environment as well as choosing a given action in order to produce a desired outcome; from an evolutionary point of view, few species are provided with such a skill and generally it is typical of species being able to use tools
  • Level 3 - Imagining and Understanding. The most advanced level of analysis where one can imagine parallel worlds whether a given action should have been performed; this is a capability of human beings. In his book Sapiens, historian Yuval Harari stated that our ancestors' capacity to imagine nonexistent things was the key to everything.

What does mathematics of Level 2 and Level 3 (i.e., the new ones!) consist of? The point is to translate causal questions into statistical quantities, e.g., translate questions like "Why people die more in Italy than China?" into probability calculus over available data.

From a practical point of view, how do you perform such a translation without being affected by weird effects like Simpson's paradox? Briefly

  1. You construct a causal graphs showing the hypothetical relationships among variables (the like the ones introduced by Sewall Wright): it will represent the hypothesis on the causal relationship among variables that has to be validated on data;
  2. Identify on graph the variables acting as confounding factors (e.g., identify age as a confounder effect in the Covid-19 example) .
  3. Compute the statistics isolating the confounding factors.

All these steps are referred as do-calculus, and, today, it is implemented by a lot of open-source libraries without requiring any particular statistical know-how.

For more than a century, statistics has avoided one of the fundamental cognitive process of human mind: analysis of cause and effect. Today, for the very first time in the history of mathematics, we have the mathematical tools to analyze data and answer causal questions: it is time to unleash the power of the ladder of causation!?

Marc Chatenier

Energy Industry Consultant |Value Networker and Business Development Specialist | Sales Coach | Fractional Manager| Talks about #change #digitaltransformation #innovation #energytransition and #businessdevelopment

2 年

Enlightning ! I believe Donella Meadows (donellameadows.org) would have liked as well

回复
Riccardo Lama

President of CENELEC - Comité Europeén de Normalisation Electrotechnique | President General of CEI - Comitato Elettrotecnico Italiano

3 年

Well done Alessio ! Both David Hume and Immanuel Kant would have approved...

回复
Carlo Giovine

Partner | QuantumBlack | QuantumBlack Labs | McKinsey | Artificial Intelligence

3 年

Great article Alessio Montone! Very clear explanation of causation

回复

要查看或添加评论,请登录

Alessio Montone的更多文章

社区洞察

其他会员也浏览了