Mistakes that make your A/B test results invalid
A well-planned A/B testing can make a huge difference in the effectiveness of your product and marketing efforts. But the true impact of an A/B test comes from understanding its results and its meanings. Sometimes, a wrong interpretation may lead to wrong decision-making, even if the hypotheses were strong and the result was solid.
When it comes to testing, some companies worry about what they are testing, and not enough worry about how to properly execute their experiments. I would want to emphasize one of the common mistakes' companies tend to do when running A/B test.
Commit to your test settings
'Unless commitment is made, there are only promises and hopes... but no plans'. - Peter Drucker
When launching a test, we must fully commit to it. A commitment means that we don’t change its settings, its goals, its design variations, and of course we don’t change its traffic split. Changing variants' settings (such as traffic split) during the test period might impact the reliability of the results because of a phenomenon known as Simpson’s Paradox.
The Simpson’s Paradox
A statistical paradox that appears when we see a trend in different groups of data that disappears or reverses when these groups are combined. Imagine we wanted to compare the success rates of two possible vaccines for COVID-19. The table below shows the vaccine success for 700 perfectly divided subjects, where vaccine A is an RNA-based technology and vaccine B is an inactivated virus vaccine.
According to the above results of the experiment, one might conclude that the inactivated virus vaccine is more successful than the RNA-based vaccine. However, when drilling down and segmenting this dataset by age groups, the results looked completely different:
After only segmenting the dataset by age groups, the paradoxical conclusion is that treatment with the RNA-based vaccine is more effective than the inactivated virus vaccine for all groups. This is a completely different conclusion to the one we had before. How could our conclusion have changed so much just by adding another level?
Success and proportions
Having different groups that are unevenly split, and each of them with a different success rate (or different conversion rate), might skew our interpretation, and by doing so we would draw wrong conclusions. In the example above, the younger the subjects are, the more successful the vaccine, regardless of the vaccine technology type. Having a larger group of younger subjects in the inactivated virus increased the overall success rate. Having a larger group of older subjects in the RNA-based group, decreased the overall success rate.
When subpopulations within an overall population vary (as we saw in the example above), it could be advantageous to sample each subpopulation independently. The method of sampling from a population into subpopulations is called stratified sampling.
Overtaking the Paradox
One must understand the importance of the commitment of traffic allocation at all levels. Not paying enough attention to it might lead to the wrong conclusions, and to taking bad decisions.
Assuming we have started an A/B test by allocating 50% of the traffic to the test group (B) and the other 50% to the control group (A). After a while, we notice that variant B is not performing as well as we expected and yields a negative lift. Afraid to jeopardize the traffic, we decide to decrease variant B to just 30%.
However, Variant B users that entered our website before the change are still bucketed into the same variant they entered previously. We will now have a larger proportion of returning visitors in Variant B, who may be more likely to convert since it’s not their first interaction with our website. By doing so, we might artificially increase variant B conversions.
And what if we launch a test by allocating 20% of the users to the test group and 80% of the remaining user to the control group. After a few days with a sanity check, we have decided to evenly split the traffic to 50%-50%.
Variant A users that entered our website before the change are still bucketed into the same variant they entered previously. We will now have a larger proportion of returning visitors in Variant A, who may be more likely to convert since it’s not their first interaction with our website. By doing so, we might artificially increase variant A conversions.
In both cases, changing the traffic split (or any other changes) during the experiment equals a new experiment for analytical purposes. When analyzing the results, we would need to ignore the period before the changes occurred.
Don’t break the rules
Commitment also applies to the test settings: The ‘do not change mid-test' rule extends to your test goals and the variants’ design. We need to accept one simple fact: we are all biased. We want our experiments to win. Not because we’re cheaters, but because we want to make an impact on our business. We may also have preferences and favorite variants that we secretly hope will win during any given test.
If we’re tracking multiple goals, we may be tempted to change what the main goal should be mid-experiment and give weight to the metrics that favor our preferences. Commitment means setting the top line measurable goal beforehand and sticking with it, even if it doesn’t support our initial hypothesis. Of course, all the other supported goals must be examined alongside this top goal, but none of them should be a replacement for the top-line goal in case its results don’t align with our hypothesis.
Wrapping it all together
When we launch a test, we must commit to its settings, goals, variation designs, and traffic allocation split. Changing the split between variants while the test is running could impact the reliability of the results because of a phenomenon known as Simpson’s Paradox.
Simpson’s Paradox surfaces when sampling is not uniform and that the sample size of our segments is different. Stratified sampling is the process of dividing members of the population into homogeneous and mutually exclusive subgroups before sampling, and is one method that can prevent this phenomenon.
Commit to the test settings and do not change mid-test designs, rules, and goals. Since we are all biased and may have preferences and favorite variants, we may be tempted to overly rely on goals that might be in favor of our preferred variant. Commit to a goal metric that can be measured in the short term and that can predict the test success in the long term beforehand.
UX/UI SAAS Product Designer & Consultant ?? | Helping SAAS / AI companies and Startups Build Intuitive, Scalable Products.
6 个月???? ??? ?? ?? ???????? ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ????? ????? ?????? ??????. https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU
CEO @ Immigrant Women In Business | Social Impact Innovator | Global Advocate for Women's Empowerment
6 个月???? ??? ?? ?? ???????? ??? ?????? ???? ?????? ???: ?????? ????? ??? ??????? ????? ????? ?????? ??????. https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU
Project Lead @ Zangula | Media Production Management, Videography
1 年??
Senior Business Analyst
4 年the infamous paradox ??