Correlation Does Not Imply Causation
Note: Data Analyst should not conclude too fast. Let the Data and the further experimentation speak for themselves.
Modern scientists are carefully trained to avoid conflating causation and correlation when describing research results. A correlation between two variables may reflect the causal effect of one variable on the other, or the causal effect of another variable on both. When given an association statement between two variables with minimal context that indicates a change in the relationship for the second variable, participants inferred a causal relation, such that the first variable causes the second. ?
Correlation: A correlation is a relationship or connection between two variables in which whenever one changes, the other is likely to also change.?
Causation: A causation is a relationship in which the change in one variable causes the other variable to change.
A causal relationship requires valid experimentation and analytics to verify. In correlated data, a pair of variables are related in that one variable is likely to change when the other does. This relationship might lead us to assume that a change to one variable causes the change in the other, but it does not. Bias may lead us to conclude that one event must cause another if both events changed in the same way at the same time. There are many forms of cognitive bias or irrational thinking patterns that often lead to faulty conclusions and economic decisions. We often can’t admit or accept that we’re wrong about something, even if that attitude causes eventual harm and loss. Exercising too much reference on your own personal beliefs, having overconfidence and relying on other unproven sources of information often produces an illusion of casualty. So, the desire to make money can often cloud your logic. As a result, you might end up spending more than your return on investment (ROI) on marketing and other business expenses. Our brains are wired for cause-relation cognitive bias. We need to make sense of large amounts of incoming data, so our brain simplifies it. This process is called heuristics, and it is often useful and accurate. But not always. An example of where heuristics go wrong is whenever you believe that correlation implies causation. Correlation is something which we think, when we can not see under the covers. So the less the information we have the more we are forced to observe correlations. Similarly, the more information we have the more transparent things will become and the more we will be able to see the actual casual relationships. There exists also a spurious correlation, a mathematical relationship in which two or more events or variables are associated but not causally related, due either to coincidence or the presence of a third, unseen factor.?
Researchers concluded that kids between 4 and 6 years old who took music lessons showed evidence of boosted brain development in areas related to memory and attention. But there are other variables to consider. The fact that the children took music lessons is an indicator of wealth. So, they probably had access to other resources that are known to boost brain development like good nutrition. Yes, there is clearly a correlation, but there is no actual evidence of causation. We need more data to get a true causal explanation.?Research also reveals that both the number of cancer cases and the number of mobile phones has gone up in the last 20 years. Brain processes this information with cause-relation cognitive bias, that mobile phone causes cancer. There is no proof of that other than the fact that both data points happen to increase. A lot of other things have also increased in the past 20 years, and they can not all cause cancer or be caused by mobile phone use. It is not analytically correct to say, mobile phone usage correlates to increased cancer risk and that cancer cases correlate to the number of mobile phones.?
Does higher-earning cause higher education? Does higher education cause higher earning potential? We do not know. However, we can make predictions. We can use this correlation to predict the earning potential of an individual based on his education. We can also predict his education based on his earnings. In the absence of experimental evidence, it is very difficult to know whether the higher earnings observed for better-educated workers are caused by their higher education, or whether individuals with greater earning capacity have chosen to acquire more schooling. A confounding variable affects both variables to make them seem causally related when they are not. For example, ice cream sales and violent crime rates are closely correlated, but they are not causally linked with each other.? Instead, hot temperatures, a third variable, affects both variables separately. There is a directionality problem occurs when two variables correlate and might actually have a causal relationship, but it is impossible to conclude which variable causes changes in the other. Without controlled experiments, it is hard to say whether it was the variable you are interested in that caused changes in another variable. Correlation may not equal causation is that there is some third variable that affects both X and Y at the same time, making X and Y move together. The technical term for this missing is usually unnoticed variable or “omitted variable”. For example, in the study on the sex-income relationship, People Who Have More Sex Make the Most Money but the underlying factor could be that sexual activity is associated with good health, endurance, mental well-being, mental capacities and dietary habits, it could be perceived as a health indicator, which might influence returns to labor market activity.?
The fact that X and Y moving together may not imply that X causes Y is that Y might be causing X instead. The technical term for this is “reverse causality”. Another reason why correlation does not imply causation is that the sample we are looking at is not representative of the population of interest. The technical term for this is that we have sample selection. If we relate this to the study on the sex-income relationship. It could be the case that a huge chunk of people who are unemployed (and hence earn 0) are having lots of sex but they never appear in the data because their wages are zero, if these people did appear in the data, then the relationship might look very different. The outcomes that we are interested in are difficult to measure and hence can only be imperfectly observed. When certain types of people with certain traits are more likely to misreport the variable that we are interested in, then this can lead us to infer incorrect relationships. As one reader comments: It could also be so that people who lie about their income, tend to also lie about how many times they have sex.
There are two facts that are totally true: kids eat more ice cream in the summer than in other months. And kids are more likely to drown in the summer. You could look at these facts and say: wow, ice cream must cause drownings! When in fact, we know ice cream sales go up in the summer, and pool usage goes up in the summer. They are not really related, but if you just look at the data you might assume they are.? Correlation-causation fallacy is when people assume a cause-and-effect relationship simply from correlation. The correlation-causation fallacy is prevalent in most societies since everyone working in marketing would like you to believe that buying their product causes your life to be better without taking the time to run a rigorous scientific experiment to test that. In 2001 a scientific paper noted that people who eat lots of vegetables and olive oil have less wrinkly skin and that is a valid observation. A nutritionist pounced on this result and began claiming that eating olive oil causes you to get fewer wrinkles. This ignores the effect of several confounding factors. Olive oil is expensive compared to other cooking oils, so if you can afford olive oil, you are more likely to have an indoor job with less sun exposure and less likely to smoke. Not to mention one of the hundreds of other lifestyle differences that are not considered in the study.?
We cannot simply assume causation even if we see two events happening, seemingly together, before our eyes. Why? First, our observations are purely anecdotal. Second, there are several other possibilities for an association.?
For instance, may be, the opposite is true: B actually causes A. ?
The two are correlated, but there is more to it: A and B are correlated, but they are actually caused by C.?
There’s another variable involved: A does cause B—as long as D happens.?
There is a chain reaction: A causes E, which leads E to cause B.?
?It might be tempting to associate two variables as “cause and effect.” But doing so without confirming causality in a robust analysis can lead to a false positive—a causal relationship seems to exist but is not actually there. A false positive can occur if you do not extensively test the relationship between a dependent and an independent variable.?
False positives are problematic for product insights because you might incorrectly think you understand the link between important outcomes and user behaviors. Just after finding correlation, do not draw the conclusion too quickly. Take time to find other underlying factors as correlation is just the first step. Find the hidden factors, verify if they are correct and then conclude. While correlation is easily observable, determining causation is much more complicated and requires an appropriate experimental design.?Just because people in the UK tend to spend more in the shops when it is cold and less when it is hot does not mean cold weather causes frenzied high-street spending. A more plausible explanation would be that cold weather tends to coincide with Christmas and the new year sales.Consider underlying factors before conclusion. In some cases, there are some hidden factors which are related on some level. Like in our example of ice cream sales and homicide rates, weather is the hidden factor which is causing both the things. Weather is actually causing the rise in ice cream sales and homicides. As in summer people usually go out, enjoy nice sunny day and chill themselves with ice creams. So when it is sunny, wide range of people are outside and there is a wider selection of victims for predators. There is no correlation without causation. If neither A nor B causes the other, and the two are correlated, there must be some common cause between the two. It may not be a direct cause of each of them, but it is there somewhere. This implies something extremely powerful. You need to control for common causes if you are trying to estimate a causal effect of A on B.?
In the same way that a correlation does not imply a causation, it can also be said that a lack of correlation does not imply a lack of causation. It has to work both ways. This might seem like a strange point to make. Many people assume that correlation is the minimum required, and then other forms of analyses have to be applied. But it is important to remember that truth is independent, and something does not become less truthful because of our inability to measure it. If something happened and a cause occurred, then it happened whether we have a way of measuring the causation or not. Just because you can not see a correlation, that does not mean that there was not some kind of causation at play. It is vital not to fall into the trap of forgetting about this issue because it can be very important. It's not easy to measure and establish causation, and there is no set path that will guarantee an easy way to test it. It all depends on the situation at hand and what kind of causal relationship needs to be tested. Of course, you cannot just assume that correlation implies causation; Nature wired humans to see patterns, and our ability to properly process that urge seems to short-circuit the longer we spend gambling. We can rationally accept that independent events like coin flips keep the same odds no matter how many times you perform them.?But we also view those events, less rationally, as streaks, making false mental correlations between randomized events. Viewing the past as prelude, we keep thinking the next flip ought to be tails.
Correlation tests for a relationship between two variables. Many experienced executives would that more working hours cause more sales and start making their sales team work nonstop. While it is possible that working more hours causes more sales, a high correlation coefficient is not hard evidence for that. Another possibility may be a reverse set of causation. It is possible that because of the increase in sales, there is more paperwork, therefore the need to stay longer at the office to complete it. In this scenario, working more hours may not cause more sales. There may also exist a third factor responsible for the association between the two variables. For example, experienced salespeople work longer hours and also do a better job of selling. Therefore, the real cause is having employees with lot of sales experience, and the recommendation should be to hire more experienced sales professionals. However, seeing two variables moving together does not necessarily mean we know whether one variable causes the other to occur. This is why we commonly say “correlation does not imply causation.??
Conclusion: It seems like correlation is about analyzing static historical data sets and considering the correlations that might exist between observations and outcomes. The fact remains that predictions do not change a system. It is not fair making better predictions through better casual understanding, rather, we need to know the precise limits of the techniques we use to make predictions and what each method can do for us. Whenever we see a relationship between two variables, it is wise to be conservative and assume that the relationship is correlational rather than causal. The correlation-causation fallacy has been widely studied and is one of the biggest pitfalls you can fall into early in your data upskilling journey. No self-respecting statistician would contest the notion that correlation does not imply causation. When you find correlation, it can be an indication to examine the situation further to determine if causation can be established between the variables.?
References:?