登录查看更多内容

Does Correlation really prove Causation!

Mala Deep U.

发布日期: 2020年3月28日

TL;DR: Correlation does not necessarily mean causation! See yourself through this Infographic.

“Correlation does not prove causation”: This was the statement I came across during my Udacity-Bertelsmann Technology Scholarship on Data Track Course- 2019. I was awestruck by this line. I was doing EDA, and based on correlation; I summed up my result(causation accepted). [Yes, I was wrong!]

That very line from Bertelsmann Data Track course made me realize that I was steering towards wrong analysis; thus, I started to dig deeper and try to understand the thin line difference between Correlation & Causation.

Understanding the phrase “Correlation does not prove causation” and underpinning the concept on your next data science project will make you double confident.

What’s Inside:

Understanding the correlation.
Calculating correlation.
Understanding the causation.
Establishing causation.
The key differences between correlation and causation

Before jumping into the process of being double confident, let’s understand the underlying meaning of each concept and move forward.

What is the Correlation?

Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables Or correlation is simply a relationship between anything. The general and most prefer objective of the analysis is to identify the extent to which one variable relates to another variable, i.e., to see how to target variable is dependent on an independent variable.

A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the difference in the values of the other variable.

How is the correlation measured?

Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. There are three possible results of a correlational study:

Positive correlation: One variable increases; the other variable increases.
Negative correlation: One variable increases; the other variable decreases.
No correlation: There is no apparent relationship between the two variables.

If you are familiar with pandas then Pandas dataframe.corr()is used to find the pairwise correlation of all columns in a dataframe and to make the result obtained from dataframe.corr() look beautiful and more comfortable to interpret, you can import Seaborn library, and plot Heatmap also called Pearson coefficient of correlation. To know more about it, read my post on EDA.

The correlation coefficient should not be used to say anything about the cause and effect relationship. By examining the value of ‘r’, we may conclude that two variables are related, but that ‘r’ value does not tell us if one variable was the cause of the change in the other.

So, here comes the need of understanding Causation.

What is Causation?

Also known as causality or cause and effect, indicates that one event is the result of the occurrence of the other event, i.e., there is a causal relationship between the two games.

It tries to answer the question: does one variable impact the other?

How can causation be established?

When data shows a correlation, then we can say that there is necessarily an underlying causal relationship. Still, we cannot confidently say that there are a cause and effect relation. For establishing causation, we can approach two further processes after correlating.

Controlled study
Non-spuriousness

Controlled study

The use of a controlled study is the most effective way of establishing causality between variables. In a controlled study, the data is split into two, i.e., treatment(which would be the independent variable) and interest (the dependent variable) with both groups being comparable in almost every way. After that, these two groups receive different treatments, and the outcomes of each group are assessed.

How to perform a controlled study? Find more on the below article.

Non-spuriousness

The spurious or false relationship exists when what appears to be an association between the two variables is caused by a third extraneous variable, i.e., A and B are correlated, but they’re created by C.

So, in non-spuriousness, it requires that alternative explanations for the observed relationship between two variables should be ruled out, i.e., the analysts should take greater challenges in ruling out spurious relationships and establish the non-spuriousness among the variables.

Find more about Spuriousness for causation in the below article.

After understanding the underpinning points about correlation and causation, we can move to see what’s the difference.

So, What’s the difference between correlation and causation?

Correlation and causation are often confused because the human mind likes to find patterns even when they do not exist. Also, if there is a stable association between the two variables, we cannot assume that one causes the other. Even if there is a strong correlation, we cannot jump directly to causation without doing at least a randomized controlled experience.

E.g., smoking is correlated with alcoholism, but it does not cause alcoholism.

This example shows that there is a correlation, but it is not causation.

In practice, however, it remains difficult to establish causation, compared with establishing correlation.

Conclusion

Understanding causation is a difficult problem. Looking at the correlation and jumping into making bold claims without checking causation is a totally wrong approach, and unless and until causation can be clearly identified, it should be assumed that we are only seeing the correlation and still causation is lacking. The more confident you become at identifying true correlations and causation within your dataset, the smarter you be in the data science domain.

If you have any questions or thoughts on the article, feel free to reach out in the comments below, or through direct message on Linkedin, or through my website.

This article first appeared on the blog of Analytics Vidhya at Medium.

PS: If you would like to get more data science-related content directly to your mail address then do check out the below link.

要查看或添加评论，请登录

Mala Deep U.的更多文章

Unlocking the Power of Data Visualization: Boost Clarity and Impact with the Data-Ink Ratio

2023年6月4日

Unlocking the Power of Data Visualization: Boost Clarity and Impact with the Data-Ink Ratio

Above all else, show the data. — Edward R.

2 条评论
Probability Vs Likelihood- Know the Quick Differences

2022年5月22日

Probability Vs Likelihood- Know the Quick Differences

As a potential data scientist, you should be clear on the thin line between probability and likelihood. Although…
My experience on Massachusetts Institute of Technology Global Startup Labs Program 2018, Nepal

2018年9月16日

My experience on Massachusetts Institute of Technology Global Startup Labs Program 2018, Nepal

Date: June 15,2018 - August 3, 2018 Cause: Education Location: Nepal Introduction Massachusetts Institute of…

2 条评论

Does Correlation really prove Causation!

Mala Deep U.

What’s Inside:

What is the Correlation?

How is the correlation measured?

What is Causation?

How can causation be established?

Controlled study

Non-spuriousness

So, What’s the difference between correlation and causation?

Conclusion

Mala Deep U.的更多文章

社区洞察

其他会员也浏览了

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

The Powers of “Normal Distribution”

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

TIME SERIES FORECASTING APPROACH

Delivering The Right Level Of Analytical Detail

Feature Selection for faster analytics

Parzen Window & Kernel Density Estimation

PCA - Principal Component Analysis

Different random forest packages in R

What’s Inside:

What is the Correlation?

How is the correlation measured?

What is Causation?

How can causation be established?

Controlled study

Non-spuriousness

So, What’s the difference between correlation and causation?

Conclusion

Mala Deep U.的更多文章

Unlocking the Power of Data Visualization: Boost Clarity and Impact with the Data-Ink Ratio

Probability Vs Likelihood- Know the Quick Differences

My experience on Massachusetts Institute of Technology Global Startup Labs Program 2018, Nepal

社区洞察

其他会员也浏览了

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

The Powers of “Normal Distribution”

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

TIME SERIES FORECASTING APPROACH

Delivering The Right Level Of Analytical Detail

Feature Selection for faster analytics

Parzen Window & Kernel Density Estimation

PCA - Principal Component Analysis

Different random forest packages in R