Simpson's paradox in A/B testing

Simpson's paradox in A/B testing

Your CEO is pounding on you to find ways to grow the business to achieve 40% YoY growth. After brainstorming with your team, you come up with a few initiatives. You want to try the most promising one – but it’s risky. So, you want to roll it out in phases (just like any prudent growth hacker) to the 1M daily traffic hitting your website. On day 1, you test 1% of the traffic. Based on day 1 results, your then test 10% traffic on day 2; then test 25% on day 3 and then test 50% on day 4. Looks logical. Here are your observations.

No alt text provided for this image
Test data conversion rate outperforms control data conversion rate every single day

You increase the test size each day encouraged by what you see for the test data conversion rate in comparison to that of the control data, the previous day.

Choice is clear. Test alternative looks better consistently!

Right?

…. except ….

when you aggregate over 4 days, you get this .....

No alt text provided for this image
An aggregated view tells a very different story


Notice that the conversion rates are flipped! Control performs better compared to test!!!


This is commonly referred to as Simpson’s paradox, typically encountered when you deal with unequal samples or sub-categories, which can act as a lurking variable.

Another example of this paradox is Covid-19 incidence rates in China vs. Italy in 2020. While the overall incidence rate in the early days in Italy was higher than that in China, the incidence rates in each age group was higher in China than in Italy! (https://arxiv.org/abs/2005.07180).

So, what conclusions do you draw from these observations? Contextual knowledge can help take meaningful actions. In the former example, an aggregated view is most conclusive, because you are planning for a long term implementation of the test alternative. On the other hand in the latter example, a sub-categorical analyses will help drive more accurate conclusions and more strategic approach.

In general, it makes sense to characterize the dependent variable along each category (if such data is available), in addition to looking at as an aggregation. You want to be as sure as possible while interpreting the observations.

要查看或添加评论,请登录

Sudhir Buddhavarapu的更多文章

  • Learning and sharing

    Learning and sharing

    For over 20 years, I have been fortunate to have worked in several business verticals - spanning startups, a startup…

  • A giant in Modern Astronomy

    A giant in Modern Astronomy

    Henrietta Swan Leavitt This is Henrietta Swan Leavitt – well known in certain circles, but largely unknown to the wider…

  • Recurring revenue - how do you forecast?

    Recurring revenue - how do you forecast?

    While software-as-a-service (#saas) become more and more ubiquitous in newer businesses, it is no surprise that older…

  • AI/ML in Go-To-Market (GTM)

    AI/ML in Go-To-Market (GTM)

    AI/ML in Go-To-Market Traditional marketing strategy focuses on keeping the funnel healthy on a continuous basis…

  • Net Revenue Retention: How high can one go with ARPU?

    Net Revenue Retention: How high can one go with ARPU?

    Net Revenue Retention (NRR) is considered a key metric in gauging a SaaS business’s success. Defined as growth in…

  • Retention in SaaS business – how early is early?

    Retention in SaaS business – how early is early?

    While new acquisitions are important in a #SaaS business, customer #retention is absolutely vital for sustaining high…

社区洞察

其他会员也浏览了