Exploratory Analytics and COVID-19: A Case Study in Real Time (Part 2)
Many discussions, explanations, and examples of spurious correlations can't help but include actor Nicolas Cage (more on that topic in a moment), but my own favorite example is the so-called "Super Bowl Indicator" relative to the stock market. As the theory legend goes, if the Super Bowl winner is either:
1) a team from the National Football Conference, or
2) was originally in the National Football League before the 1970 merger, but then moved over to the American Football Conference,
then the "stock market" would be higher by the end of the year. A couple of quick notes before proceeding:
1) "The stock market" could mean the Dow Jones Industrial Average (DJIA), the S&P 500 Average, or the New York Stock Exchange (NYSE) Index, for purposes of this particular - ahem - "leading indicator." In some years, the indicator proved true for, say, the DJIA but not one or both of the other two indicators; likewise for the NYSE Index but not the other two; and so on. In most of these "split decision" situations, the end of year returns hovered within a few percentage points, either way, of break-even on all of the indices.
2) Re: the NFL team alignment on either side of the indicator, the three teams that were originally NFL teams but went to the AFC as part of the 1970 merger - and thus would indicate an up year for stocks if one of them were the Super Bowl winner - were my hometown Pittsburgh Steelers (six times!); the original Cleveland Browns; and the Baltimore (later Indianapolis) Colts.
Starting with Super Bowl I in 1967 and continuing for thirty more years through Super Bowl XXXIII in 1997, the Super Bowl indicator was correct in all but 3 years using the DJIA, and in all but 4 years according to the S&P 500 or the NYSE index!
I can still remember Monday, January 21, 1985 - the morning after the San Francisco 49ers won Super Bowl XIX - when the stock market set a new all-time record. In fact, the Page 1 story in the Pittsburgh Post-Gazette the following morning even cited the Super Bowl Indicator as a possible cause for the stock market's blast-off performance that Monday: "Under a whimsical theory that has enjoyed much publicity in recent years, a victory in the pro football championship by a National Football Conference team such as the San Francisco 49ers, who defeated the Miami Dolphins on Sunday, is supposed to be a favorable portent for stock prices." (I looked that story up on Google News for this article, FYI.)
Now that's a spurious correlation! In other words: just a coincidence, folks. True, for 31 years, from 1967 through 1997, that was one heck of a coincidence, being correct all but 3 (or 4) years...but still only a coincidence!
Seriously...what force in the universe could possibly link the Super Bowl outcome in mid-January (at least in those days) with the fate of the stock market over the next 11 1/2 months? If you can find that mystical force, then call Rod Serling!
(FWIW, since 1997, the Super Bowl Indicator has come back to earth. In fact, between 2000 and 2019, the indicator has failed 11 times and been correct only 9 times.)
And that takes us to actor Nicholas Cage.
"Nick Cage Movies Vs. Drownings, and More Strange (but Spurious) Correlations" reads the headline in a National Geographic (yes, that National Geographic!) article published online on September 11, 2015. Essentially, among other examples of spurious correlations - i.e., just coincidences, rather than true causal relationships - we find a correlation between the number of Nicholas Cage movies released and the number of people who drowned by falling into a swimming pool.
Then there's this article from August, 2014: "Bizarre correlations that will leave you wishing Nicolas Cage would retire" that cites the same Nicolas Cage statistic.
But just as with the Super Bowl/Stock Market Indicator, take a step back and rejoin reality. Could there possibly be any actual correlation between the number of Nicolas Cage movies released in a given year and anything...other than Mr. Cage's salary and bank account balances?
Here's the linkage between exploratory analytics (see Part I of this three-part series) and spurious correlations...and it's not a good one!
Not that long ago, I was discussing the "analytics continuum" with someone, and attempting to draw a distinction between predictive and exploratory analytics. Whereas the discipline of predictive analytics tends to be constrained - i.e., we build models for a specific business process, or a specific function within a business process, with the mission of "tell me what is likely to happen" - exploratory analytics are more open-ended, with a mission that can be described as "tell me something interesting and important from all of this data."
The other person immediately countered my explanation with "yeah, but then you'll just wind up with all kinds of spurious correlations."
Maybe...or maybe not.
The idea of exploratory analytics is not to simply produce as many possible "interesting" correlations as can be found "in the data." In fact, the discipline of analytics in general helps us detect and filter out spurious correlations.
Analytics are every bit as much an art as a science, though, and sometimes it's difficult to definitively determine whether a given finding - some particular correlation - is spurious or, conversely, might actually be meaningful. I would argue that if we do a thorough job at turning powerful analytics loose on mountains of data, as we do with broad-based exploratory analytics versus its more constrained predictive analytics cousin, we are actually fated to produce some "interesting and important" findings that, when further analyzed, do turn out to be spurious rather than significant and meaningful.
Let's go back to the finding that I referenced in Part I: the possible relationship between a person's blood type and susceptibility to the COVID-19 virus.
It's entirely possible that as further analysis is done, this relationship doesn't hold up. Perhaps the seemingly heightened susceptibility of people with Type A blood turns out to be purely a coincidence. Or perhaps there's some other "hidden" variable that is the root cause of this susceptibility rather than the blood type itself.
How can we know?
Let's slightly restate what the referenced article is telling us: we have a hypothesis that there is a causal relationship between one's blood type and COVID-19 susceptibility.
What do we do with a hypothesis such as this one? We either prove or disprove it...and in the meantime, we can proceed to take certain preliminary actions based on that hypothesis. In Part III, I'll show how we use a well-defined analytical workflow to actually embrace the idea of analytics-produced hypotheses rather than shy away from them for fear of being suckered in by spurious correlations.
_____________
Alan Simon is the Managing Principal of Thinking Helmet, Inc., a boutique management and technology strategy consultancy specializing in analytical business process management, business intelligence/analytics, and enterprise-scale data management.
Alan is the author or co-author of 31 business and technology books, dating back to 1985. He is also the author of five LinkedIn Learning/Lynda.com courses, the most recent being EDGE ANALYTICS: IoT AND DATA SCIENCE.