The Reproducibility Problem—Can Science be Trusted? (Part 1 in a series of 3)
Dr. Chris Stout
LinkedIn Top Voice | Best Selling Author | Adventurer | Startup Whisperer | (Accidental) Humanitarian | APA's "Rockstar" Psychologist | éminence Grise
I grew up as an undergrad math (nascent computer science) major, took a detour to the school of engineering and technology, and wound-up back in the school of science, but this time with psychology as my major, and a sheepskin noting a bachelor’s degree in that aforementioned science. My conversion to psychology was via a 101 course in Psychology as a Social Science. Intellectually smitten, I next took 102 Psychology as a Biological Science. You may see a theme emerging.
One of my most enjoyed courses was a 200-level methodology class. Therein we learned of the impact of one’s biases and how to guard against their intrusion, what makes for a good (and bad) study design, sampling approaches and sizes, and never proving anything but rather disproving the null hypothesis. I enjoyed this even more that methodology’s sibling, statistics, and its controlling for contaminating variables, and sussing the probabilities of what we thought we found out to be untrue.
The moral of the scientific method’s story was that when everything is done properly, you get science. From science, you get trust. From trust, you can make judgements about proper actions to take. From making those actions, you then get the expected, predicted, recurrent result. Or so we thought.
But something’s not right….
Seventy percent of 1500 scientists surveyed in a study published in 2016 in Nature, could not reproduce the same outcome in at least one other scientist’s experiment. Furthermore, half could not even replicate their findings when run again.
What?
Perhaps one of the most recognizable researchers in a semi-new area of research called metascience (the scientific investigation of scientific investigation), Ioannidis, a Stanford statistician and professor of medicine, found that the majority of health-related studies published in the last 10 years could not be replicated and that around 17% of the studies examined were subsequently contradicted in replication studies.
This does not seem to be limited to healthcare, psychological and medical studies. Indeed, metascientific investigations have found evidence for reproducibility problems in sports science, marketing, economics, and even hydrology.
My first specialty, clinical psychology has taken this hit particularly hard. Unfortunately, other psychological disciplines have also found problems as well, including social psychology, developmental psychology, and educational research.
This just got real
So, if you cannot trust studies on pollutants, how can you craft environmental policy? If you cannot trust studies on teaching approaches, how can you better design educational curricula? If you cannot trust clinical trials on a psychotherapeutic approach or promising medication, how do you accurately instruct graduate or medical students, or properly treat patients, construct clinical guidelines, moderate doses, and protect against untoward side-effects?
If the evidence-base is off, then the treatment guidelines predicated on them will likewise be suspect, and perhaps even more concerning—iatrogenic.
So how did this happen?
Cargo Cult Science
This was the title of Richard Feynman’s commencement address to the Caltech class of 1974. Therein he notes the psycho-sociological phenomenon wherein during World War II, a community of Pacific Islanders misinterpreted the observational data at hand of a runway, towers, and crewmen seeming to be the causal aspects of producing the result of food staples and other beneficial cargo, floating down from the heavens under parachute canopies that were shared with them. Once the war was over and the troops departed, the indigenous people created their own runway, towers and mimicked headset equipment in order to have the same resultant cargo come raining down.
Now, we can sit back and tut-tut such a silly expectation, but Feynman points out that many of us may first fall in love with our theories and models. We may personally, professionally, and economically invest in their (and us) being right. We look for any and every particle of evidence in support of showing the world it is so, and by proxy, how smart and cool we are. In other words, we build really, really nice runways and towers.
It’s hard to admit bias. Heck, by definition, a blind-spot is something unable to be seen. So isn’t it hubris to think such things don’t insinuate themselves into our scientific processes as well?
Sometimes it’s dishonesty
It’s one thing to be blinded by one’s own desire to be right about something important. It’s another to just make stuff up. Yes, folks do that, and they get published.
While I’ll not psychoanalyze theories as to why some do this, let’s look at a study. [Quick pause. I understand there is a great irony in my citing any research finding in an article on not being able to trust research findings. Quite meta, but be that as it may….] Fanelli found that in the anonymous studies she did, there was an admission rate of 1–2% for falsifying or fabricating data. Similarly, Ioannidis found few cases of misconduct in studies looking at reproducibility of findings.
Sometimes it’s being dumb
There is a term for this—QRP, or questionable research practices. Echoing Fanelli’s and Ioannidis’ opinions, such is considered to not be too wide spread or impactful. Phew.
Sometimes it’s p-hacking or playing in Anscombe’s Quartet
Head, Holman, Lanfear, Kahn, and Jennions define “p-hacking” or “selective reporting” in their paper as occurring “…when researchers collect or select data or statistical analyses until nonsignificant results become significant.” They found that while they believe that “…p-hacking is widespread throughout science…its effect seems to be weak relative to the real effect sizes being measured. This result suggests that p-hacking probably does not drastically alter scientific consensuses drawn from meta-analyses.”
In a reference I have bookmarked in my web browser, because I need the reminder and it’s so cool, Matejka and Fitzmaurice published a paper (and quite fun video) that is based on Anscombe’s Quartet. “They note this ‘quartet’ is a group of four data sets, created by the statistician F.J. Anscombe in 1973, that have the same ‘summary statistics,’ or mean, standard deviation, and Pearson’s correlation. Yet they each produce wildly different graphs.” For example:
The dataset called the Datasaurus, “like Anscombe’s Quartet, serves as a reminder to the importance of visualizing your data, since, although the dataset produces ‘normal’ summary statistics, the resulting plot is a picture of a dinosaur. In this example (they) use the datasaurus as the initial dataset, and create other datasets with the same summary statistics.”
Schwab notes that “Matejka and Fitzmaurice made 200,000 incremental changes to the ‘Datasaurus’ data set, slightly shifting points so that the summary statistics stayed within one-hundredth of the originals. GIFs that show the slowly shifting points next to the summary statistics hammer their point home.”
Sometimes it’s the half-life of facts
As I have opined in the past, in the popular media, we all read about how this or that a food or activity is alternatingly good or bad for us. In an article I wrote on How to Protect Yourself from Fad Science dealing with how to figure out what to do concerning tweaking our behavior in the context of the latest healthcare headline, I stated “…like most things, the answer is the ultimate in unsatisfying—it depends. We like to have things clear, specific and definitive. Easy to understand and readily meme-able. So, when something we like to eat that was formerly a sin is now reported to be a blessing, then we feel we have the proof for what we knew all along and we feel vindicated with the seal of science.”
But as we learned in frosh biology (or was it genetics?) “…not all treatments work the same in all people, if they did the world would not need NSAIDs in addition to aspirin. Ditto that for exercise, diet, learning, and most anything else, except for maybe math. We’re complex beings to which one size fits few. It’s more amazing to me to see something that generalizes to an N greater than one.
Ideas evolve and legitimate studies can be in conflict with other studies, without any shenanigans or ethical issues afoot.”
It reminds me of the humorous but true introduction to new medical school students…
50% of what we teach you over the next five years will be wrong, or inaccurate. Sadly, we don’t know which 50%
Samuel Arbesman, a Harvard mathematician, coined the term “half-life of facts” in reference to the predictability of scientific papers’ findings to become obsolete. “What we think we know changes over time. Things once accepted as true are shown to be plain wrong. As most scientific theories of the past have since been disproven, it is arguable that much of today’s orthodoxy will also turn out, in due course, to be flawed.” In medical science, it can be pretty quick—by some estimates a 45-year half-life. Mathematics does a better job as most proofs stay proofs.
And sometimes it may just be Simpson’s Paradox
Simpson’s Paradox as defined in Wikipedia is a “phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.” Matejka gives the example wherein “one set of data appears to show crime increasing…yet when that data is broken down by location, there is a strong downward trend in crime in each area–another example of how data that has the same summary statistics can look vastly different when it’s graphed.” No fraud, no foul, but a good cautionary tale vis-à-vis interpretation and conclusion.
In Part 2 of this series we’ll take a look at thoughts as to how we got into this mess.
# # #
If you'd like to learn more or connect, please do at https://DrChrisStout.com. You can follow me on LinkedIn, or find my Tweets as well. Tools and my podcast are available via https://ALifeInFull.org.
If you liked this article, you may also like:
Can AI Really Make Healthcare More Human—and not be creepy?
How to Protect Yourself from Fad Science
Technology Trends in Healthcare and Medicine: Will 2019 Be Different?
Commoditization, Retailization and Something (Much) Worse in Medicine and Healthcare
Fits and Starts: Predicting the (Very) Near Future of Technology and Behavioral Healthcare
Why I think 2018 will (Finally) be the Tipping Point for Medicine and Technology
Healthcare Innovation: Are there really Medical Unicorns?
Can (or Should) We Guarantee Medical Outcomes?
A Cure for What Ails Healthcare's Benchmarking Ills?
Can A Blockchain Approach Cure Healthcare Security's Ills?
Why Medicine is Poised for a (Big) Change
Is This the Future of Medicine? (Part 5)
Bringing Evidence into Practice, In a Big Way (Part 4)
Can Big Data Make Medicine Better? (Part 3)
Building Better Healthcare (Part 2)
Team Lead, Physicist. System Engineering, Ion sources, Accelerator, Semiconductor, Medical Devices, Adaptive, Flexible. I get things done.
5 年Long-time back someone wrote an article in Guardians, it is on a similar footing. I have written an opinion piece for that article. Though it may not fit here it certainly would add some value down the line... Excerpt from that article is here, The bottom line is, many results are produced in academics journals and all may not be reliable but only the reliable ones will be produced afterward excluding the unreliable ones. https://www.dhirubhai.net/pulse/science-reporting-media-dr-samir-chauhan/?published=t
Engineering Process Manager
5 年I am also going to link the following as it also demonstrates the factual correctness being wrong because of missing data or data bias. https://youtu.be/NYoOcaqCzxo
Engineering Process Manager
5 年I am going to link to the following because it also discusses this issue. https://youtu.be/LfHEuWaPh9Q