A/B testing Statistics: By a Beginner for the Beginners
Thinking of running an A/B test for the first time but getting it tough to wrap your head around the statistics??
I understand what you might be feeling. The statistics get confusing in no time and you feel like flying through a cloud of confusion.
A lot of videos, dozens of articles, a few papers, and some slides later, I was able to understand at least the common terms that you will find in any online sample size calculator for A/B testing.
In this post I’ve tried to explain them in an easy manner -- that’s what I think so please let me know if anything below confuses you.
I’ll pick CXL’s calculator for reference. I feel it is the best one for a beginner. It offers both pre-test and post-test analysis and gives you more metrics than other calculators.
So, let’s get straight to what is what. The common terms you come across while using any sample size calculator are:
Let’s understand what these beasts are.
P-value, Confidence Level, and Power
You might be thinking why this guy is clubbing three terms in one section! It’s because they are interrelated and I feel it makes more sense to club them and explain.?
But before we move there. Let’s first learn about two devils you constantly fight while running most statistical analysis:
We use a particular p-value to avoid the chances of getting a false positive or type-1 error. The statistical wisdom says to set the p-value to 0.05 which is nothing but a 95% confidence level.
So p-value and confidence level are tomato and taimato. And both help us keep type-1 errors at bay.?
Now, what does a p-value of 0.05 signify? It tells how strong evidence your test has gathered to reject the null hypothesis. In other words, the results you got weren’t due to a fluke or some random chance.?
If you are a beginner like me just accept the explanation above. When you go deep, you’ll find that my explanation wasn’t completely right. And also how tough it is to explain p-value without throwing some statistical terms and plotting the normal distribution curve..
For the time being, you will do fine without knowing in-depth about the p-value.?
Now since we know p-value and confidence level, let’s understand the sneaky relationship between p-value and type-2 error.?
P-value is inversely proportional to type-2 errors. Yup! It means if you set your p-value to .05, you reduce the risk of type-1 errors by 95% (yay!) and increase the risk of type-2 by the same amount (buzz killer).
That’s right.?
Here comes the power into the picture.?
Now, what is power? Put simply, it is the probability of keeping your type-2 errors at bay.
With great power comes great responsibility to avoid type-2 errors. Again wisdom from our ancestors that delved into the realm of statistics says that it is cool to set the power to 80.?
If your sample size has a power of 80, there is a good chance you will ignore type-2 errors.
So far these are the default numbers:
If you ensure these three, you are rewarded with a gift of statistical significance.?
Now, what is statistical significance? It means you have ensured that no type-1 or type-2 errors are present and your testing results are closer to reality.
If you don’t reach statistical significance, that means the results your testing shows might be due to some random effect or chance, and not due to what you assumed (your hypothesis).
If you don’t reach statistical significance, it would be a waste of time to run a test. That’s why it is advised to calculate the sample size before you begin with the testing.?
领英推荐
Minimum Detectable Effect (MDE)
So far we have covered the easy things, now we try to understand what MDE or minimum detectable effect is.?
Put simply, it is the amount of change in conversion rate or lift you expect from your variant. So if you believe your variant can lift your conversion rates from 120 to 140, you are assuming an MDE of 16.6%.
Now here is an important thing -- the higher the MDE, the lower will be the sample size, and vice versa. In a way, MDE is inversely proportional to the sample size.
But what does this mean?
It means depending on your traffic, and budget, you have to create a variant that can lead to an MDE that makes it possible for you to run a test in the first place.
Thus, if you have a website with low traffic and you are testing very small and less apparent changes, then there are high chances that you won't be able to detect it.?
The less apparent a change would be, less it will have chances to influence your buyer which will translate to a low MDE and huge sample size.
That’s why common advice is to combine multiple changes in a single test. It increases the chances of influencing your buyers which translates to high MDE and low sample size.?
Out of the topic talk: I found the MDE to be the most confusing and counterintuitive concept so far. I thought the p-value will :p (lol) me but it was MDE that kept me on my toes in the last week.?
In a way, you assume an MDE value to calculate your sample size. I may think this variant will lead to an 18% lift. My colleague may differ and say no it would be 10%.
Maybe he is right or maybe we both are wrong! So it becomes subjective.?
Also, how can I estimate the right MDE for a variant? These were some of the questions I asked myself.?
Later on, I found an explainer by Bhavik on calculating MDE (Thanks Bhavik!) but I’m still confused. Even more, the coolest thing was that Craig Sullivan also suggested reading his article.
So far the solution is to check the statistical significance or p-value at the end of the test. Use the Post Test Analysis tool to find what additional size you may need to reach statistical significance and run the test for a few more days.?
Confidence Interval
It is not easy (at least for me) to explain what Confidence Interval is without bringing terms like sample mean, population mean, sampling errors.
At this juncture, it won’t be that much important for you to understand what CI is but let’s understand what it tells you or how to interpret what it says.?
Below is a screenshot of a post-test analysis. The conversion rate of the variant-1 was 1.12% and the confidence level was 95%.
You can see that the confidence interval for variant-1 lies between 1% to 1.3%. This is the range of my confidence interval.
So what does this information tell me??
It tells me that if I re-run the same experiment any number of times with the same variant-1 on the same website with a sample size at a significance level of .05, then 95/100 times I’ll get a conversion rate between 1%-1.3%.
In a way, the confidence interval gives a range that is likely to contain the result with a certain level of confidence.
Wrong-way to interpret CI
Wrapping it Up
I hope these explanations help you kickstart your journey to understand A/B test statistics. Let me know what you think in the comments. And if you want to up your A/B testing statistics ante, I recommend Georgi Georgiev's course on A/B testing statistics.
I believe it is the only or might be among the few courses that cover statistics of A/B testing for web experiments. It is tailored for a CRO which makes it unique.?
UX Designer| UX Researcher| Usability Testing| User Interviews | User-Centric Design | Web Accessibility | Qualitative & Quantitative Research | UX Writing| Data Analysis | A/B Testing | GA4 | HTML/CSS| NNG Certified
7 个月Excellent article! It was the first article that I read about the MDE that really helped me to understand it better. Good job!