Simpson's paradox in A/B testing
Sudhir Buddhavarapu
Leader, Analytics, Business Intelligence, Data Science in SaaS B2B / DTC companies. Astrophotography enthusiast
Your CEO is pounding on you to find ways to grow the business to achieve 40% YoY growth. After brainstorming with your team, you come up with a few initiatives. You want to try the most promising one – but it’s risky. So, you want to roll it out in phases (just like any prudent growth hacker) to the 1M daily traffic hitting your website. On day 1, you test 1% of the traffic. Based on day 1 results, your then test 10% traffic on day 2; then test 25% on day 3 and then test 50% on day 4. Looks logical. Here are your observations.
You increase the test size each day encouraged by what you see for the test data conversion rate in comparison to that of the control data, the previous day.
Choice is clear. Test alternative looks better consistently!
Right?
…. except ….
when you aggregate over 4 days, you get this .....
领英推荐
Notice that the conversion rates are flipped! Control performs better compared to test!!!
This is commonly referred to as Simpson’s paradox, typically encountered when you deal with unequal samples or sub-categories, which can act as a lurking variable.
Another example of this paradox is Covid-19 incidence rates in China vs. Italy in 2020. While the overall incidence rate in the early days in Italy was higher than that in China, the incidence rates in each age group was higher in China than in Italy! (https://arxiv.org/abs/2005.07180).
So, what conclusions do you draw from these observations? Contextual knowledge can help take meaningful actions. In the former example, an aggregated view is most conclusive, because you are planning for a long term implementation of the test alternative. On the other hand in the latter example, a sub-categorical analyses will help drive more accurate conclusions and more strategic approach.
In general, it makes sense to characterize the dependent variable along each category (if such data is available), in addition to looking at as an aggregation. You want to be as sure as possible while interpreting the observations.