The Replication Crisis

The Replication Crisis

Personal statements or any other information or views are not endorsed or authorised by my employer

Readers may already have heard something about the “replication crisis” in psychology. This issue first arose about ten years ago when several high-profile experiments “proving” certain hypotheses were retested, only for it to be discovered that the results were radically less substantial second time around. Psychological research was badly hit, with there being considerable debate as to the implications and a general loss of confidence in the field. More positively, there has recently been a gradual process of remedial actions and signs of improvement.

This article explains the crisis and the potential causes, with a later article examining how psychology began to recover from the crisis and what those of us working on evaluation can learn from a close examination of this issue.

Proving psychic effects?

Virtually all social scientists would scoff at the idea that psychic phenomena are real. Thoroughly debunked phenomena such as telepathy, telekinesis, clairvoyance and other forms of extra-sensory perception are more likely to be met with a giggle than serious study. As a result, it was rather surprising when a leading psychology journal published a 2011 paper by Professor Daryl Bem of Cornell University providing evidence that precognition may be a real phenomena.

No alt text provided for this image

Bem’s main experiment tested whether people could recall words better if they had read them in advance or if they had the opportunity to rehearse them. Astonishingly, the main finding from the test was that recall of tested words was better if rehearsal took place after the test.

So, positive evidence that precognition may actually exist? Being sensible, Bem was by no means convinced and suggested in his paper that other psychologists repeat the experiment to see what they could find out. Professor Chris Finch at Goldsmiths took up the challenge, following the exact approach outlined by Bem, working with other institutions to conduct three separate tests. The results of the replication were what you would expect - there was no evidence that rehearsing the words after the test made any difference to the number of words recalled. The scientific community could sleep soundly again, with parapsychology firmly locked away in the box of bogus pseudo-science.

Replicating other studies

Bem’s study was shortly followed by a tongue-in-cheek paper by Simmons, Nelson and Simonsohn “proving” that listening to When I’m 64 by The Beatles actually made you 18 months younger. All the authors needed to do to create this illogical result was adopt four specific unethical approaches1. By deliberately highlighting the large-scale implications of taking up this small number of “researcher degrees of freedom”, the issue of replication began to gain traction. Still, while Bem and Simmons et al had illustrated that replication was potentially an issue, it still was not clear whether this was a real issue in psychology.

No alt text provided for this image

Since then, various efforts have been made to replicate various studies in order to assess the extent that studies fail to replicate. In 2015, attempts were made to replicate 100 previous published studies, while 97% of the published studies originally showed significant results, similar results were seen only 36% of replications. Less than half of those that did have a significant result had an effect size in line with the original study. Results from a 2018 study of 21 experiments originally published in Nature and Science showed that only 13 replicated. The large-scale ManyLabs2 project assessing 28 studies across 36 countries and territories found similar results with around half replicating, and the overall effect size declining from 0.60 in the original studies to 0.15 in the replications. What might have been a minor issue began to be branded as a “crisis”.

Individual studies

Attempts to replicate papers showed that it was not just the quantity of papers that failed to replicate that was a concern, but that some of the better known studies or theories were affected.

The marshmallow test was based on whether children could resist eating a single marshmallow if they were promised two marshmallows later. Original studies suggested those who could resist the initial temptation grew up to be more intelligent and have better behaviour than those who couldn’t. They showed better self-control later in life, could resist impulse, planned better, as well as having lower BMI and less drug use. Unfortunately, replication showed that correlations were far smaller than these original findings suggested. For example, replication reports suggested that behavioural outcomes at age 15 were “much smaller and rarely statistically significant”. You can now watch with pride as your children gorge down on sweets without worrying…

No alt text provided for this image

Particularly hard hit has been social priming, an idea popularised by the likes of Malcolm Gladwell and Daniel Kahneman which suggested that behaviour can be affected unconsciously by providing people with subtle cues. Bargh’s paper showing that subtly priming people with words about old age makes them walk slower failed to replicate, leading to angry responses from the Yale professor. His paper showing that holding a cup of hot coffee makes you feel “warmer” to other people also failed to replicate. Other social priming studies have also met with less positive results, including priming for intelligence by providing subjects with stereotyped primes of professors or hooligans.

As a result, social priming is no longer seen as having the type of widespread impact originally assumed, with meta-analysis now showing it is only likely to have small effects and only when the prime directly relates to something the participant cares about.

Causes

There are various possible reasons why papers don’t replicate.

The first main possibility is that there isn’t actually a real crisis at all. Rather, replications fail because replicating studies exactly is difficult and so replications often aren’t actually testing the original experiment properly at all. Kahneman and Bargh both initially suggested that this could be a factor in the failure of social priming studies to replicate – that the “conduct of subtle experiments has much in common with the direction of a theatre performance” and that replications often fail to reconstruct the vital minor manipulations in priming studies.

No alt text provided for this image

The size of the ManyLabs2 study allowed some of these “hidden moderators” to be examined, with their results suggesting it is “unwise” to dismiss “failure to replicate as a consequence of such moderators”. They found occasional evidence for cultural effects but little evidence for different procedures having different impacts. The sample used and setting of experiments was similarly unlikely to have meaningfully contributed to any failure to replicate – studies that failed to replicate or succeeded in one country showed similar results in other countries.

Moreover, as the authors point out, if a study fails to replicate it is easy to find a false positive moderator, finding one among many possible options that appears to make statistical and logical sense in helping to explain “why the main effect ‘failed’”. This is especially the case as many studies don’t have sufficient statistical power to identify hidden moderators with reasonable certainty.

The second possibility is that the crisis has been partly caused as a result of deliberate attempts by the original researchers to get papers published. While there are a small number of cases where deliberate attempts are claimed to have been made, such as those involving Brian Wansink or Diederik Stapel, there is no evidence that this at all common in psychology, let alone in papers that fail to replicate.

No alt text provided for this image

Another suggestion has been that the quality of the team may have an effect. Perhaps the original studies were undertaken by particularly trained staff and those doing the replication had lower levels of expertise? Again, this doesn’t seem to be the case – the large-scale study of 100 experiments reporting that replication success was “more consistently related to the original strength of evidence… than to characteristics of the team and implementation of the replication”.

So, if it isn’t “hidden moderators”, deliberate fraud or the quality of the team working on the study, what exactly was causing the crisis? We’ll look at this in more detail in an upcoming article and assess some other key questions to understand in more detail what those of us working on evaluations can learn to improve our practice.

1 Namely choosing among dependent variables; choosing sample size; using covariates; reporting subsets of experimental conditions. 

要查看或添加评论,请登录

Diarmid Campbell-Jack的更多文章

社区洞察

其他会员也浏览了