Real life example of AB test sample size calculation
Lokesh Sharma
Director, Product Analytics | Build & Inspire High Performing Teams | Expertise in Product Analytics, Data Insights, Data Warehouse, Visualization | Ex-Google & Goldman Sachs
Availability bias is a well known cognitive bias that refers to the tendency to overestimate the importance of information that is easily available or memorable, while underestimating the importance of information that is harder to recall or obtain. We wanted to check if availability bias can change user behavior in 1:1 sales pitch. In particular, we wanted to test if users who receive an email about account optimization are more likely to implement the optimization after a 1:1 phone call vs users who did not receive the email and only pitched the account optimization on a 1:1 call.
Typically users implement the optimization 30% of the time after 1:1 sales pitch. We wanted to check if users who received an email explaining the benefits of the optimization (e.g how much additional revenue this optimization will generate), are more likely to implement the optimization due to availability bias (reading the email about the optimization). In theory users who receive email will be more familiar with the optimization due to email and hence more likely to implement the optimization after a 1:1 call. To test this theory, we decided to do an experiment.
Experiment setup:
Based on above experiment setup, we can calculate the number of users needed to reject the null hypothesis as follows
Lets calculate parameters in above equation
σ2 = p1*(1-p1) + p2*(1-p2)*r = (0.3 * 0.7)+(0.4*0.6)*1 = 0.45 (p is probability of implementation)
?? = 0.05 (5% significance level or 95% confidence level)
?? = 0.2 (80% statistical power)
领英推荐
Z1-?? = 1.64 (from z statistics table)
Z1-?? = 0.84 (from z statistics table)
Δ = p’ - p = 0.40-0.30 = 0.1 (p’ is expected conversion after email and p is current conversion rate)
n = (square(1.64 + 0.84) * (0.45))/(0.1*0.1) = 278
So we needed at least 278 users in each group to read account optimization pitch email followed by 1:1 pitch to reject the null hypothesis that email does not increase the implementation rate with 95% confidence level and 80% power.
In the end, we learnt that emails do increase the implementation rate and we were unable to reject the null hypothesis in favor of the alternate hypothesis.
More details on power and significance level:
Statistical power: The probability of correctly detecting a difference between the control and treatment groups if it exists. (80%). A power of 0.80 means there’s an 80% chance that, if there is an effect, you’ll accurately detect it without error. Meaning there’s only a 20% chance you’d miss properly detecting the effect.
Significance level: The probability of rejecting the null hypothesis (i.e., that there is no difference between the control and treatment groups) when it is true. (5%). As a very basic definition, significance level alpha is the false positive rate, or the percentage of time a conversion difference will be detected — even though one doesn’t actually exist. This number means there’s less than a 5% chance you find a difference between the control and variant — when no difference actually exists. As such, you’re 95% confident results are accurate, reliable, and repeatable.
Product Data Analyst | Researcher
1 年A nice post, thanks! Just wondering, do you know formulas for hard cases when we can't use z-test or t-test and don't even know the metric distribution? Usually, I see that Monte-Carlo simulation is used for analysing, but nothing about size calculating ??
Senior Recruiter at Heitmeyer Consulting
2 年Excellent details, thanks for sharing.
Product Leader | AI/ML | Strategy, Tactics, Leadership
2 年An interesting refresher, thanks! One gets used to yes/no findings from A/B testing, it helps to get a reminder of where a decision comes from. Do you have an equivalent explanation for multi-variate testing?