Mistakes that make your A/B test results invalid

Mistakes that make your A/B test results invalid

A well-planned A/B testing can make a huge difference in the effectiveness of your product and marketing efforts. But the true impact of an A/B test comes from understanding its results and its meanings. Sometimes, a wrong interpretation may lead to wrong decision-making, even if the hypotheses were strong and the result was solid.

When it comes to testing, some companies worry about what they are testing, and not enough worry about how to properly execute their experiments. I would want to emphasize one of the common mistakes' companies tend to do when running A/B test.

Your Test Has Too Many Variations

“If you don't know where you are going, any road can take you there.” - Lewis Carroll, Alice in Wonderland.

The more variations we test, the more insights we will get. Well, not exactly. Having too many variations not only slows down the test, but more importantly, it impacts the integrity of the results. The more variations we test, the larger the sample size required to be able to detect a significant lift at a specific confidence level. The larger the sample size needed, the longer the required test duration to get results that we can trust. This is simple math.

The risks of running a test for a long period of time are well known. First, when one test is running, all other tests are waiting in the pipeline to be executed. Having a test run for a long period of time means that we invest our time and resources just for that one opportunity, and we are not moving fast enough to execute tests and exploit other opportunities. Second, tests that are running for more than 4 weeks are risky, as sample pollution increases dramatically. Some users may have deleted their cookies. Some cookies may have expired. Users may have been exposed to a different version than the one they were originally assigned, which might eventually skew the test results.

Multiple Comparison Problem

When testing multiple variations, the true confidence level goes down as the number of variations increases. Surprisingly, but when testing three different variations against the default version at a 5% significance level, one of those variants will be significant purely by chance (3* 0.05). Regularly, a 5% significance level implies that there is a 95% certainty that when we conclude there is a difference, there really is a difference (as explained in my previous article). It also means that there is a 5% chance that when we conclude there is a difference, there really isn’t (false positive). But the more variations we test, the higher the chance of a false positive. It is called the Multiple Comparison Problem.

Correcting the Problem

No alt text provided for this image

We can calculate the Multiple Comparison Problem’s true probability of a false positive by using the following formula: 1-(1-a) ^m with m being the total number of variations tested and a being the significance level.

The Bonferroni correction is one of several methods used to counteract the problem of multiple comparisons. Testing m variations, at a significance level, the Bonferroni correction suggests testing each individual hypothesis at a significance level of a/m, such that the overall test significant level would add up to a.

Assuming we are testing 3 variants against the default version at a 5% significance level, the Bonferroni correction suggests testing each hypothesis individually at a significant level a/m, which is 0.05/3 = 0.0167. In other words, we need a 1.67% significance level (a 98.33% confidence level) for each of the 3 individual hypotheses (5% for the whole test).

But rather than testing each hypothesis at the a/m significant level, the hypotheses may be tested at any other combination of levels that add up to a, provided that the level of each test is determined before looking at the data. For 3 hypotheses, an overall a of 0.05 could be maintained by conducting two hypotheses at 0.02, and the remaining at 0.01.

Quantifying Confidence

The more variations there are, the higher the confidence level required for an individual hypothesis. The following table illustrates the increase in the true probability of a false positive assuming a 95% confidence level as the number of hypotheses increase:

No alt text provided for this image

The Bonferroni correction method to counteract the problem of multiple comparisons suggests the following adjustment for both significance and confidence levels to maintain an overall 95% confidence level for the whole test:

No alt text provided for this image

Wrapping It All Together

Having too many variations not only slows down the test but also impacts the integrity of its results. The more variations there are, the larger the required sample size to detect a significant lift at a specific confidence level. The larger the sample size, the longer it will take for a test to converge.

When a test is running, all other tests are waiting in the pipeline, and the company is not moving fast enough to execute tests and exploit other optimization opportunities. Moreover, tests that are running for a long period of time are risky as sample pollution increases dramatically due to cookie deletion or expiration. Therefore, users may have been exposed to a different version than the one they were originally assigned, which might eventually skew the results.

When testing multiple variations, the confidence level goes down as the number of variations increases. The more variations we test, the higher the chance of a false positive. The Bonferroni correction is one of several methods used to counteract the problem of multiple comparisons. When testing an m hypothesis at a significance level a, each individual variant’s significance level should be a/m such that the overall test significant level would equal a. Alternatively, hypotheses may be tested at any other combination that adds up to a.

Amichai Oron

UX/UI SAAS Product Designer & Consultant ?? | Helping SAAS / AI companies and Startups Build Intuitive, Scalable Products.

6 个月

???? ??? ?? ?? ???????? ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ????? ????? ?????? ??????. https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

回复
Anna Gorolova

Senior Business Analyst

2 年

Great. thanks Shir Varon ? Hiring ? for sharing

andrey K.

Data Engineer

4 年

awesome

Gil Blum

Chief Data & Analytics Officer | Driving Business Insights & Growth through Data Analytics | Multi Cloud Data Solutions Architect

4 年

Great post!

要查看或添加评论,请登录

Shir Varon的更多文章

  • Mistakes that make your A/B test results invalid

    Mistakes that make your A/B test results invalid

    A well-planned A/B testing can make a huge difference in the effectiveness of your product and marketing efforts. But…

    9 条评论
  • ?????? ??????? ?????

    ?????? ??????? ?????

    ????? ????? ?????? ?? ????? ???? ????. ????? ?? ????? ????, ?????? ????, ?????? ???? ??????? ??? ????????? ?? ???????…

    14 条评论
  • A/B Testing: Planning and Executing

    A/B Testing: Planning and Executing

    A/B testing, or a split test, is a method to measure the impact of a change in a controlled environment. It can be used…

    7 条评论
  • A/B Testing: Variability, Sample Size, Confidence and in between

    A/B Testing: Variability, Sample Size, Confidence and in between

    A/B testing, or a split test, is a method to measure the impact of a change in a controlled environment. It can be used…

    7 条评论