A/B testing for conversion rate, revisited
A quick refresher: Conversion Rate (CR) is the proportion of users that performed an action (typically buy/book) once landed on the site. Mathematically, CR is modeled with a Bernoulli variable (the simplest random variable due to having only two values and so best fit to model an action/no-action situation). We typically test for a change in CR via an A/B test; in these tests we track the number of actions in two test variants (conveniently called A and B) and we also track the number of users exposed to each one of the variants. An A/B test will generate two CR estimations: one for the A variant, typically the control group, and one for the B variant, the group that is exposed to the changes we are testing (the estimation is simply CR=#Actions/#Users). The question we want to answer then is how significant is the difference between the two estimates. Why? Because the difference could purely be by chance (due to measurement noise, for example). The frequentists’ approach to the significance question is the so-called null-hypothesis technique which, very broadly, goes as follows: choose an underling distribution for the data and label ‘rare’ events as significant. Both the chosen distribution and way we define ‘rare’ are very subjective.
Typically, conversion rate is tested using the tried and true two-proportion z-test; here we'll suggest a different approach. In context of a conversion rate A/B test, assume that users are randomly but equally assigned to each one of the two test variants, namely we choose the test variant for the user by flipping a (virtual) fair coin. In that situation it is enough to just count actions in the two test variants. Here is the key observation: if we have an action, the conditional probability for this action to be in the A variant is: CRA/(CRA+CRB) where CRA is the (real) CR in A and CRB is the (real) CR in B; this follows from Bayes rule and a little bit of math. Now, having this nice observation at hand we next notice that if the number of action in A is K then number of action in B has the Negative Binomial distribution with K successes and probability for success p that equals CRA/(CRA+CRB). There is an implicit assumption here which I'll discuss below. With this we now have all that is needed to design a null-hypothesis test. Indeed, we already have a distribution, the negative binomial distribution, and its parameters follows since under the null hypothesis we assume that CRA=CRB so the probability for success is simply 0.5. The rare/significant values in this case (also can be called 'decision boundaries') are L and H such that the probability to be above H is less than some threshold (typically 2.5% if rare is less than 5%) and probability to be below L is less than some threshold (again, 2.5%, typically). Let's illustrate this with a quick example:
Say we have 50000 actions in A and we define the significant threshold to be 5%. The decision boundaries are simple to compute using the following R commands:
- H = qnbinom(prob=0.5,size=50000,0.975)
- L = qnbinom(prob=0.5,size=50000,0.025)
Few notes:
- The even split between the two variants is nonessential and can be easily removed by tweaking the math.
- This method has enough flexibility to control for power or MDE but I'll work out the details in a subsequent post.
- The hidden assumption I mentioned above is that actions have non-overlapping time stamps (i.e., no concurrent actions). From this assumption it follows that there is a natural order on the actions. This important since the negative binomial distribution actually models series. Nevertheless, I believe this is a reasonable assumption in many cases, especially if the frequency of actions is not too high and/or not too distributed.
- Why just counting actions is useful? It is useful, for example, if you want to online monitor your test: user are typically distributively allocated to the tests variants (usually across different data centers) and so using this technique we can avoid the need to route the allocation messages into one single repository (which then become a single point of failure).
(Cross posted from my blog; see my profile for details.)
I solve problems using AI and ML.
8 年As someone else said, one approach would be a chi square test of independence as this is a test of proportions. (O-E)/E
Applied Machine Learning
8 年Nice post! You may also be interested in the following app I developed for a small complimentary discussion: https://p-value-convergence.herokuapp.com/ It talks about how easily one may get fooled into thinking that an AB test is statistically significant due to the slow convergence of the "power" in hypothesis testing...
AI @Zapier | Advisor @ Quack AI & Ravenna
8 年Gabi Lee
Director of Game Economy & Analytics
8 年or non paramatic test such as mann whitney
A Product Leader & Creator that is an expert and a lover of people, fintech, e-commerce, SaaS, marketplace, data, UX and much much more...(yes even Gen-AI)
8 年Another point worth making is that ANOVA test assume the distribution in each group is normal - what is often not the case. Sometime using a non-parametric test such as chi square test is all that you need