Everybody Lies: The Curse of Dimensionality
One of my student's wives read this book recently and agreed to lend it to me as an airplane book for a recent trip. Totally enjoyed it! The book is primarily about how, in the age of Big Data, we can conduct sociological experiments not based on what people tell an interviewer, but by what they do on the Internet when they are not knowingly the subject of observation. (Of course, when we are on the Internet we are ALWAYS the subject of observation!)
The book went through all the traditional Internet-age Big Data things. Like most sociologists (and most book editors who guide authors to What Sells) quite a bit of it seemed to be about our fascination with porn. But then we talked about Facebook and A-B Testing, and how A-B Testing has swept Silicon Valley. When you have millions of users, show a small subset Thing A while others see Thing B and see which of them leads to deeper interaction and "stickiness" of your website. Which is why Q-Anon is a thing and President Trump is President. It turns out the things that fascinate us are not always the things that are good for us.
The book is full of amazing stories where Big Data made incredible predictions. One of my favorites was the story of Jeff Seder, a Wall Street Quant who walked away from the city to apply his science to picking the ulimate race horse. One of his clients ends up buying the horse that will become American Pharoah, the first horse in three decades to win the Triple Crown. I won't spoil the story by telling how, but suffice it to say that Seder had studied hundreds of variables in thousands of horses, compared them to their race results, and found one magical variable that convinced him that American Pharoah was going to be an invincible race horse.
But then, as if to counter his own mystique about Big Data, the author unveils what to me was the most important observation in the book: "The Curse of Dimensionality." It explains a phenomenon that has been being widely discussed in medical research. Namely that the results of many medical studies, and possibly MOST medical studies that are published in the literature CANNOT BE REPLICATED. See for example the 2017 BBC report "Most scientists cannot replicate studies by their peers" reporting on The Reproducibility Project by Tim Errington at the University of Virginia showing that the findings of many cancer studies cannot be confirmed in follow-up studies. Two years earlier, the same lab found the same results in 100 published psychological experiments, as reported in The Atlantic's story "How Reliable Are Psychology Studies?" In that study, while 97% of the 100 studies reported "statistically significant results" only 36% of the replicate studies did so. The trend has been observed for some time, as reported in John Ioannidis' paper "Why Most Published Research Findings Are False" which appeared in 2005. In that paper, Dr. Ionnnidis describes a whole genome association study testing to see if any of 100,000 gene polymorphisms are associated with a susceptibility to schizophrenia. He comments that "commercially available data mining packages actually are proud of their ability to yield statistically significant results through 'data dredging.'
John lists six corollaries in his paper:
- The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
- The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
- The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
- The greater the flexibility in designs, definitions, outcomes, and analytic modes in a scientific field, the less likely the research findings are to be true.
- The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
- The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
Unfortunately, Big Data Science is fraught with these corollaries. Especially 3, 4, 5, and 6! Is there any hotter topic right now than personalized medicine? And wasn't the Human Genome supposed to unlock all of the secrets to eternal life? Venture Capital Firms are betting billions each year on being able to find The Magic Gene, or its equivalent in many other fields.
That's why I was so fascinated with "The Curse of Dimensionality" and the clear way that Seth explained it in "Everybody Lies." He was actually exposed to the concept while working on experiments to see if his Big Data work could lead to an unbeatable hedge fund.
Here's an excerpt from that passage:
"Suppose your strategy for predicting the stock market is to find a lucky coin -- but one that will be found through careful testing. Here's your methodology: You label one thousand coins - 1 to 1,000. Every morning, for two years, you flip each coin, record whether it came up heads or tails, and then note whether the Standard & Poor's Index went up or down that day. You pore through all your data. And voila! You've found something. It turns out that 70.3 percent of the time when Coin 391 came up heads the S&P Index rose. The relationship is statistically significant! Highly so! You have found your lucky coin!
Just flip Coin 391 every morning and buy stocks whenever it comes up heads. Your days of Target T-shirts and ramen noodle diners are over. Coin 391 is your ticket to the good life!
Or not."
Seth goes on to explain that when you test too many things, just by chance, you are likely to find at least one thing that demonstrates statistical significance. He gives an example of a 1998 study by Robert Plomin who studied the DNA of geniuses - those with an IQ of 160 or higher - and after comparing it to "normal people" DNA found that a gene named IGF2r in Chromosome 6 was twice as common in geniuses. The New York Times heralded the finding:
but Dr. Plomin himself retracted the finding when, in a future experiment, no such correlation was found.
Seth Stephens-Davidowitz concludes that section with the recommendation: "Don't fall in love with your results" and also, have the humility to do "out of sample" testing. Ask a colleague with a DIFFERENT data set to confirm your results before rushing to the journals. Unfortunately, if we follow this advice, we would like see bankrupt medical journals, and a death of funding as the pressure to prove significance in the Scientific Community is the key to future funding.
That was MY favorite story from "Everybody Lies" - but please do pick up your own copy of the book. Delightful read and I'm sure you'll find something that you'll want to share with your friends and colleagues as well. (Another favorite chapter of mine was "Mo Data, Mo Problems" but I'll leave that for you to enjoy on your own.)
Librarian at OMERS
3 年I'm beginning to think that this book itself is a lie. Discuss...
Owner, Bowden Services Ltd
6 年Does the Curse of Dimensionality apply to your posting?
Vistage Peer Advisory Board Chair | Veteran Business Executive | Helping People and Companies Grow & Succeed
6 年Great review.? However, my wife will be very unhappy with you, I'm now buying ANOTHER book!? :-)??
Data Scientist at Alabama Power
6 年Ordered.? I'll let you know my thoughts when I'm done with it.