A/B testing for conversion rate, revisited

Uri Weiss

Data Science | Mentor | Ex-Google

发布日期: 2017年2月3日

A quick refresher: Conversion Rate (CR) is the proportion of users that performed an action (typically buy/book) once landed on the site. Mathematically, CR is modeled with a Bernoulli variable (the simplest random variable due to having only two values and so best fit to model an action/no-action situation). We typically test for a change in CR via an A/B test; in these tests we track the number of actions in two test variants (conveniently called A and B) and we also track the number of users exposed to each one of the variants. An A/B test will generate two CR estimations: one for the A variant, typically the control group, and one for the B variant, the group that is exposed to the changes we are testing (the estimation is simply CR=#Actions/#Users). The question we want to answer then is how significant is the difference between the two estimates. Why? Because the difference could purely be by chance (due to measurement noise, for example). The frequentists’ approach to the significance question is the so-called null-hypothesis technique which, very broadly, goes as follows: choose an underling distribution for the data and label ‘rare’ events as significant. Both the chosen distribution and way we define ‘rare’ are very subjective.

Typically, conversion rate is tested using the tried and true two-proportion z-test; here we'll suggest a different approach. In context of a conversion rate A/B test, assume that users are randomly but equally assigned to each one of the two test variants, namely we choose the test variant for the user by flipping a (virtual) fair coin. In that situation it is enough to just count actions in the two test variants. Here is the key observation: if we have an action, the conditional probability for this action to be in the A variant is: CRA/(CRA+CRB) where CRA is the (real) CR in A and CRB is the (real) CR in B; this follows from Bayes rule and a little bit of math. Now, having this nice observation at hand we next notice that if the number of action in A is K then number of action in B has the Negative Binomial distribution with K successes and probability for success p that equals CRA/(CRA+CRB). There is an implicit assumption here which I'll discuss below. With this we now have all that is needed to design a null-hypothesis test. Indeed, we already have a distribution, the negative binomial distribution, and its parameters follows since under the null hypothesis we assume that CRA=CRB so the probability for success is simply 0.5. The rare/significant values in this case (also can be called 'decision boundaries') are L and H such that the probability to be above H is less than some threshold (typically 2.5% if rare is less than 5%) and probability to be below L is less than some threshold (again, 2.5%, typically). Let's illustrate this with a quick example:

Say we have 50000 actions in A and we define the significant threshold to be 5%. The decision boundaries are simple to compute using the following R commands:

H = qnbinom(prob=0.5,size=50000,0.975)
L = qnbinom(prob=0.5,size=50000,0.025)

Few notes:

The even split between the two variants is nonessential and can be easily removed by tweaking the math.
This method has enough flexibility to control for power or MDE but I'll work out the details in a subsequent post.
The hidden assumption I mentioned above is that actions have non-overlapping time stamps (i.e., no concurrent actions). From this assumption it follows that there is a natural order on the actions. This important since the negative binomial distribution actually models series. Nevertheless, I believe this is a reasonable assumption in many cases, especially if the frequency of actions is not too high and/or not too distributed.
Why just counting actions is useful? It is useful, for example, if you want to online monitor your test: user are typically distributively allocated to the tests variants (usually across different data centers) and so using this technique we can avoid the need to route the allocation messages into one single repository (which then become a single point of failure).

(Cross posted from my blog; see my profile for details.)

Jack Sawilowsky, PhD

I solve problems using AI and ML.

8 年

As someone else said, one approach would be a chi square test of independence as this is a test of proportions. (O-E)/E

Laurent Boué

Applied Machine Learning

8 年

Nice post! You may also be interested in the following app I developed for a small complimentary discussion: https://p-value-convergence.herokuapp.com/ It talks about how easily one may get fooled into thinking that an AB test is statistically significant due to the slow convergence of the "power" in hypothesis testing...

Tal Peretz

AI @Zapier | Advisor @ Quack AI & Ravenna

8 年

Gabi Lee

Erez Lalezari

Director of Game Economy & Analytics

8 年

or non paramatic test such as mann whitney

1 次回应

Dr. Liron Nehmadi (Ph.D.)

A Product Leader & Creator that is an expert and a lover of people, fintech, e-commerce, SaaS, marketplace, data, UX and much much more...(yes even Gen-AI)

8 年

Another point worth making is that ANOVA test assume the distribution in each group is normal - what is often not the case. Sometime using a non-parametric test such as chi square test is all that you need

3 次回应

查看更多评论

要查看或添加评论，请登录

Uri Weiss的更多文章

Explaining to 15 y/o what is AI

2018年6月7日

Explaining to 15 y/o what is AI

Today I went to my kids' school to present over Career's Day. It is an interesting exercise to try and explain a deep…

6 条评论
Data Science picture surprise...

2017年3月9日

Data Science picture surprise...

The picture I choose for this post may surprise you; can you tell why? Well, the reason is simple: it features a woman…

11 条评论
A/B testing for conversion rate, revisited, part 2

2017年2月17日

A/B testing for conversion rate, revisited, part 2

Part 1 addressed how to compute significance decision boundaries when A/B testing for changes in conversion rate (CR)…

6 条评论
Operational Research: important tools for data scientists

2016年12月27日

Operational Research: important tools for data scientists

When you read about data scientist the focus is mostly on prediction of values (i.e.

45 条评论
So, how many data scientists are out there at the end of 2016?

2016年12月6日

So, how many data scientists are out there at the end of 2016?

Building data science teams in Agoda often requires us to relocate candidates to Bangkok so we naturally look around…

8 条评论

See all articles

社区洞察

Statistical Data Analysis

What are the benefits and drawbacks of using stepwise methods for variable selection in multiple regression?

Uri Weiss的更多文章

Explaining to 15 y/o what is AI

Data Science picture surprise...

A/B testing for conversion rate, revisited, part 2

Operational Research: important tools for data scientists

So, how many data scientists are out there at the end of 2016?

社区洞察