Why 5% should be the upper bound of your MDE in A/B tests

Why 5% should be the upper bound of your MDE in A/B tests

In https://bit.ly/CH2022Kohavi , I suggested that in online controlled experiments (A/B tests), where one is optimizing for conversions (or revenue or task success), you need about 200,000 users in an experiment to generate trustworthy results from A/B tests.

The inputs to the power formula were alpha=0.05 (industry standard), power=80% (industry standard), conversion of 5% (domain dependent, but most conversions are 2%-5%, and higher numbers require less users), and MDE of relative 5%.

Tyler Buffington in https://bit.ly/TylerBWhySmallMDE asked if smaller companies with less optimized products could use different parameters, with the key one being the MDE.

The motivation is clear: the MDE in the power formula is squared in the denominator, so if you were to double the MDE from 5% to 10%, the number of users needed would be 1/4, or 50,000 users instead of 200,000, a massive difference.? If you’re going for home runs with 50% effects instead of 5%, you need a tenth squared or 1/100th users, so 2,000 users (a number used in Covid vaccine clinical trials, where such effects are needed).

Experimenters have a desire to iterate more quickly, so there is a bias to plug in a larger MDE, but I believe that would be a mistake, leading to exaggerated results and effectively low power (see https://bit.ly/CJamTrustworthyABCausalityAndPitfalls for low power examples and exaggeration issues).

Organizations that run experiments and replicate them for trustworthiness are often humbled by the small effects they see, so what data can we share about effects in practice? ??Here are two examples that I’ve been involved in personally:

  1. Airbnb allowed me to state that in 1.5 years of leading search relevance, we improved booking conversion by 6% by running 250 experiments, of which 20 were successful. The average successful experiment improved conversion by 0.3%.
  2. At Bing, hundreds of people worked to improve the cumulative OEC of Bing’s relevance by 2% every year. In https://eduardomazevedo.github.io/papers/azevedo-et-al-ab.pdf , the success-rate metric (a component of the OEC) varied from -0.22% to 0.28% (Table 1).

In these two examples, the mean effect and the max effect was about 0.3%.? Since you want the MDE to be lower than even the average so you can detect positive effects, let's assume half of the mean, or 0.15% as a good MDE for these large companies with highly optimized products.

If you’re working with less optimized products, you absolutely should aim for bigger effects, but when I suggested 5%, I already factored in effects that are 30+ times larger (5%/0.15%>33)! ?

If you’re solving a serious bug, where your customers are calling your support lines because they can’t check out, then you can deploy the fix and watch the revenue go up from zero.? Joe Berkson, a statistician at the Mayo Clinic, called this the IOT Test, or Inter Ocular Trauma Test, where the graph hits you between the eyes [Savage 2012 ].?

For product improvement experiments, 5% seems like a reasonable upper-bound for the MDE. If you don’t have the recommended traffic (as determined by the power formula) to run trustworthy A/B tests, use industry best practices and replicate designs from larger companies that do run A/B tests, but don’t kid yourself and set the MDE at 10%, implying that you’re so brilliant that your ideas will improve be at least 10%.

?

Suneil Shrivastav

In a world full of machine learning, be a learning machine

1 年

Don Shaher Jamal Zabihi - 5% upper bound on the MDE!

Robson Tigre, Ph.D.

Data Scientist | Ph.D. in Economics | Research economist | Causal inference | Experimentation

1 年
Josh Attenberg

Technology, Data and ML Leadership

1 年

If you're at a smaller company with much less traffic, whats the way to adapt to a statistically valid experimental process? Raise the alpha?

回复
Rasoul Jabari

Co-Founder @ DELIS ?? || Business Advisor || Marketplace Growth Consultant

1 年

A fantastic read as always! During my time leading CRO experiments at an eCommerce company with 10 million users, we utilized a p-value of 0.02 in our Product Detail Page (PDP) and shopping cart experiments with MDE=1% in most cases. Whenever we observed a conversion rate increase of more than 2%, we would become extremely excited and celebrate. While we did achieve some notable victories, such as a 7% increase, the majority of our successes fell within the 1-2% range

Jakub Linowski

Chief Editor of GoodUI - Conversion Focused UI Designer

1 年

Thanks for sharing. It's great to hear that you advocate for adjusting the MDE on a per experiment / domain basis, as opposed to keeping it fixed as a 5% absolute. This flexibility of combining both historical data with some subjective adjustments is somewhat inline with what Philip Tetlock has found when studying the prediction rates of the most successful estimators in The Art Of Superforecasting. Similarity, I ran a recent poll on what people use for their MDE estimations and a "mix" seems to have topped the results: https://www.dhirubhai.net/posts/jlinowski_experimentation-optimization-estimation-activity-7098373545512075264-_eFa Based on data from: https://goodui.org/insights/ here is some extra variability in the effects estimates: - Our most generic a/b test currently has a 3.8% (median) - Checkout page experiments are even more difficult to move with a 0.2% estimate - Leap experiments, with multiple positive probability patterns grouped together, have the highest median effect of 13.5%. (This is a scenario where I would estimate using an even larger MDE / often on unoptimized sites, early on). Of course, each company should probably adjust these values further with their own data as well, like you're suggesting.

要查看或添加评论,请登录

Ron Kohavi的更多文章

社区洞察

其他会员也浏览了