p-hacking

p-hacking

p-hacking isn't really a technique to manipulate data, rather a bunch of various practices involving various techniques that aim to somehow achieve significance. This often involves performing multiple tests, adjusting variables, or selectively reporting results to achieve a p-value less than 0.05

Let's try and understand what constitutes p-hacking with an example. Say a researcher is conducting a study to investigate whether listening to classical music improves cognitive performance. The primary hypothesis is that participants who listen to classical music will perform better on a memory test compared to those who do not listen to any music

Original Analysis

The researcher conducts a memory test on 2 groups: one listens to classical music before the test, and the other group does not listen to any music. The results show no significant difference between the two groups (p > 0.05)

p-hacking techniques:

1. Post Hoc Subgroup Analysis

  • The researcher decides to divide participants into subgroups based on age, gender and education level. They find that among participants aged 20-30, those who listened to classical music performed significantly better on memory test (p < 0.05)
  • They report this subgroup result as if it were a primary finding without disclosing that this was discovered after multiple subgroup analysis

2. Selective Reporting

  • The researcher initially measured several cognitive outcomes: memory, attention, and problem-solving. Only the memory test was not significant. However, by chance, the attention test showed a significant improvement (p < 0.05) in the music group.
  • They choose to report only the attention test results, ignoring the primary outcome (memory) and other non-significant results.

3. Re-defining Variables

The researcher redefines the success criterion for the memory test. Initially, the number of correctly recalled items was the measure. They change it to the percentage improvement from a pre-test to post-test, finding that this redefined measure shows a significant improvement (p < 0.05)

4. Data Exclusion

After examining the data, the researcher notices that some participants had very low scores, which could be considered outliers. They decide to exclude these participants from the analysis, and after this exclusion, the results show a significant effect (p < 0.05)

5. Stopping Data Collection

The researcher collects data in phases and checks the results periodically. At one point, they observe a significant result (p < 0.05) and decide to stop collecting further data and report the findings, without mentioning the interim analyses

Outcome:

The final published study claims that listening to classical music significantly improves cognitive performance, specifically in attention and among younger adults. These findings are a result of p-hacking, not a true effect.

How/Why p-hacking works?

The probability of getting a significant result increases with multiple testing due to the principles of probability. When you conduct multiple independent tests, the chance of encountering at least one significant result by random chance increases, even if none of the tests individually indicate a true effect.

Mathematical Explanation:

Suppose you are testing a hypothesis at the 0.05 significance level. This means there is a 5% chance (0.05 probability) of obtaining a significant result purely by chance for any single test, assuming the null hypothesis is true.

For a single test, the probability of not finding a significant result is 1 - 0.05 = 0.95.

If you conduct n independent tests, the probability that none of them will be significant is:

(1?0.05)^n=0.95^n

The probability of finding at least one significant result among these n tests is:

1?0.95^n

As n increases, 0.95^n decreases, and thus 1?0.95^n increases. This demonstrates that the probability of obtaining at least one significant result by chance increases with the number of tests.

Example Calculation:

Let's calculate the probability of finding at least one significant result for different numbers of tests:

  • For n=1: 1?0.95^1 = 0.05
  • For n=5: 1?0.95^5≈0.226 (22.6%)
  • For n=10: 1?0.95^10≈0.401 (40.1%)
  • For n=20: 1 - 0.95^20 ≈0.642 (64.2%)

As seen from the calculations, the probability of obtaining at least one significant result by chance increases substantially as the number of tests increases.

Why Should you do it?

  1. There's a "publish or perish" culture in academia, which incentivizes producing significant, novel results. You might be pressured to find and report significant findings to secure funding, tenure or simply career advancement
  2. There is often a bias in favor of publishing significant results over non-significant ones.

要查看或添加评论,请登录

Satwik Behera的更多文章

社区洞察

其他会员也浏览了