登录查看更多内容

The Cost of False Positive A/B Tests

Ron Kohavi

Vice President and Technical Fellow | Data Science, Engineering | AI, Machine Learning, Controlled Experiments | Ex-Airbnb, Ex-Microsoft, Ex-Amazon

发布日期: 2023年11月25日

There has been an ongoing debate in the software industry, with some claims that we should increase the alpha threshold for accepting stat-sig results from 0.05, or run one-tail tests, because 0.05 is too stringent.?

While there are certainly cases where experiments are run for short-term decisions (e.g., headline optimizations), for most experiments the real impact of false positive results is the roadmap—steering the ship into the wrong direction because of some amazing discovery that is wrong: a false positive.

I just listened to a wonderful talk by Ulrich Schimmack (https://videos.files.wordpress.com/9liB1ZFm/princeton.zcurve.22.10.11-1.mp4), where he talks about the sad state of affairs in Psychology, where replication rates are about 37% and much lower in Social Psychology with between-subject designs replication rates of 4%.

领英推荐

Hypergraphs and RDF

Kurt Cagle 1 年前

How the Pitfalls of Mnemonic Phrases Are Solved by the…

Ruben Merre 4 年前

FROM LINEAR REGRESSION TO ANOVA, A CONSTANT VARIABLE

MBSoft 1 年前

One of the examples he shares is a theory of Ego Depletion (https://en.wikipedia.org/wiki/Ego_depletion), where a claim was made in 1998 that we have a limited pool of mental resources that we use up and then lose self-control.? The initial study was held for over 15 years with widespread confidence in the robustness of the effect, including a meta-analysis of 198 independent tests in 2010 (talk about replication) that showed an average effect size of d=0.6.? In hindsight, this showed how much bias there is in accepted publications, where non-significant results are often rejected (the file drawer problem).

In 2016, a major multi-lab replication study failed to find any evidence for the theory.? A subsequent study involving 36 labs also failed to find the effect, which, if exists, was now estimated at d=0.06, an order of magnitude smaller than the initial meta-analysis study.? Uli claims that the original author, Baumeister, finally relented in 2022 and that the theory is now dead, after 24 years.? What a waste of resources!

In A/B testing, replication is cheap and easy: the code already exists. If the p-value is between 0.01 and 0.10 (yes, above 0.05 to reduce false negatives), my recommendation is to do a replication run, ideally at higher power than the original study (e.g., if you ran an A/B/C/D test, now evaluate just the winning version at 50%/50%).? Use meta-analysis to determine the combined p-value, and set the alpha threshold at 0.01 (equivalent to the improvement tail p-value of 0.005). ?

Luiz Cent ??

Hire the top 1% of talent in LatAm in less than 21 days

1 年

Sorin Enache , Valentin Radu ??

1 次回应

Ron Kohavi

Vice President and Technical Fellow | Data Science, Engineering | AI, Machine Learning, Controlled Experiments | Ex-Airbnb, Ex-Microsoft, Ex-Amazon

1 年

By the way, the False-Positive Risk is not the same as alpha. Here is a slide from my class (https://bit.ly/ABClassRKLI) that clarifies this important point.

4 次回应

Jon Crowder

Director of Digital Experience | Test and learn to build excellent digital experiences

1 年

Oh no, I've landed my container ship on the poop island. I wanted to visit the gold island.

5 次回应

Bram Meulen

Growth Team Lead | CRO | Cross Selling | Research | Validation & Experimentation ??

1 年

Adriana Gae Anca-Maria Ion

Aleksandr Kazimirov

Lead Data Analyst, ex-Tinkoff Head of Analytics Unit, Digital Nomad

1 年

Thanks for the post! It is not always possible, but replicating the A/B test for each successful outcome is a good practice.? Especially if the uplift is significantly higher than expected, there is a high chance that something went wrong.?

2 次回应

查看更多评论

要查看或添加评论，请登录

Ron Kohavi的更多文章

Goodhart’s Law with Examples

2024年8月13日

Goodhart’s Law with Examples

The British Economist Charles Goodhart is credited with the adage: “When a measure becomes a target, it ceases to be a…

21 条评论
The QA Tradeoff in A/B Testing

2024年2月15日

The QA Tradeoff in A/B Testing

I saw a post by Raphael Paulin-Daigle that made the following bold claim: The most overlooked step of the A/B testing…

32 条评论
Should you suggest or enforce a template for hypotheses in A/B tests?

2024年2月6日

Should you suggest or enforce a template for hypotheses in A/B tests?

Are we going to cover how to write a well-formed, complete hypothesis? That was a question in my course today. The…

41 条评论
When should you use quasi-experiments instead of controlled experiments, or A/B tests? The barometer question analogy

2024年1月20日

When should you use quasi-experiments instead of controlled experiments, or A/B tests? The barometer question analogy

This question reminds me of the Barometer Question (https://en.wikipedia.

17 条评论
How to set alpha when you have underpowered experiments?

2023年11月27日

How to set alpha when you have underpowered experiments?

Jakub Linowski asked how to assign trust-level to an experiment corpus when some experiments have low power (post). If…

15 条评论
Does offline accuracy of machine learning models predict performance in A/B tests?

2023年11月15日

Does offline accuracy of machine learning models predict performance in A/B tests?

Tom Willerer, who was at Netflix years ago, told an interesting story about the Netflix Prize (https://en.wikipedia.

20 条评论
Why 5% should be the upper bound of your MDE in A/B tests

2023年11月6日

Why 5% should be the upper bound of your MDE in A/B tests

In https://bit.ly/CH2022Kohavi, I suggested that in online controlled experiments (A/B tests), where one is optimizing…

22 条评论
Multi-Armed Bandits, Thompson Sampling, or A/B Testing? Are you optimizing for short-term headlines or long-term pills worth billions?

2023年6月17日

Multi-Armed Bandits, Thompson Sampling, or A/B Testing? Are you optimizing for short-term headlines or long-term pills worth billions?

When should you prefer Thompson Sampling or Multi-armed bandits to A/B tests? I was tagged for my thoughts on a post…

23 条评论
My (Biased) Review of Reforge’s Experimentation + Testing Class

2023年5月3日

My (Biased) Review of Reforge’s Experimentation + Testing Class

I just finished the live 6-week Reforge class on Experimentation + Testing, created by Elena Verna and hosted by Hila…

12 条评论
What's the OEC for the Golden Gate Suicide Nets Project?

2023年4月10日

What's the OEC for the Golden Gate Suicide Nets Project?

When evaluating ideas, it’s critical to think about the OEC, or the Overall Evaluation Criterion. Here’s a grisly…

14 条评论

See all articles

The Cost of False Positive A/B Tests

Ron Kohavi

Vice President and Technical Fellow | Data Science, Engineering | AI, Machine Learning, Controlled Experiments | Ex-Airbnb, Ex-Microsoft, Ex-Amazon

领英推荐

Ron Kohavi的更多文章

社区洞察

其他会员也浏览了

Input/Output Variables in a Test Transaction

Quantile Regression Random Forests

The Devil’s in the Detail: Why Coastlines Defy Measurement

Backdraft. How a small change to a COVID-19 model admits a new, explosive possibility

Book review - Algorithms to Live By: The Computer Science of Human Decisions

Interpreting Regression Coefficients

No One Will Read This Series - Flipping the Switch: The Rise of Binary Code

About NULL

What does 'significant' mean?

Article 54: A Deep Dive into Confirmatory and Exploratory Factor Analysis

领英推荐

Ron Kohavi的更多文章

Goodhart’s Law with Examples

The QA Tradeoff in A/B Testing

Should you suggest or enforce a template for hypotheses in A/B tests?

When should you use quasi-experiments instead of controlled experiments, or A/B tests? The barometer question analogy

How to set alpha when you have underpowered experiments?

Does offline accuracy of machine learning models predict performance in A/B tests?

Why 5% should be the upper bound of your MDE in A/B tests

Multi-Armed Bandits, Thompson Sampling, or A/B Testing? Are you optimizing for short-term headlines or long-term pills worth billions?

My (Biased) Review of Reforge’s Experimentation + Testing Class

What's the OEC for the Golden Gate Suicide Nets Project?

社区洞察

其他会员也浏览了

Input/Output Variables in a Test Transaction

Quantile Regression Random Forests

The Devil’s in the Detail: Why Coastlines Defy Measurement

Backdraft. How a small change to a COVID-19 model admits a new, explosive possibility

Book review - Algorithms to Live By: The Computer Science of Human Decisions

Interpreting Regression Coefficients

No One Will Read This Series - Flipping the Switch: The Rise of Binary Code

About NULL

What does 'significant' mean?

Article 54: A Deep Dive into Confirmatory and Exploratory Factor Analysis