A/B Testing: Variability, Sample Size, Confidence and in between
A/B testing, or a split test, is a method to measure the impact of a change in a controlled environment. It can be used to test anything from determining if a company should change its website layout to whether it should change the color of a specific button from blue to green. It starts by randomly assigning users with a single variant where A typically refers to the control group (or the default version) and B typically refers to the test group. Each user experiences version A or B where while everything else is kept constant. We then measure the output performance and use statistics to compare, analyze, and interpret the results.
Well-planned A/B testing can make a huge difference in the effectiveness of your product and marketing efforts. But before running a test, one should be able to conduct tests at his own organization.
Why Testing?
Testing allows you to make careful changes. We first construct a hypothesis, collect data, and then learn how a certain element impacts major metrics. There are several methods to test a hypothesis, where A/B testing is the most common of them all. Let’s review the alternatives:
Assuming we’d like to test the effect of changing a call-to-action button (CTA) from blue to green and measure its effect on major metrics. Our hypothesis is that by performing this change, the button CTR (click-through rate) will increase and will eventually have a positive effect on the website revenue, measured by ARPU (average revenue per user).
Before-and-After Testing
In this type of testing, we run just one experience for all users at a certain time. After a period, we change the experience (green button instead of the blue button) and compare the outputs.
Because the test isn’t running simultaneously, we are subject to external and internal factors that may impact the results in a way that isn’t necessarily related to the subject change we wanted to test (changing the button color). Holidays and special events, such as Black Friday, might impact the performance, as well as other external conditions, such as a pandemic or even a tweet by a social influencer related to our product. A special sale initiated by the marketing team might also shift the results in a way we can’t predict.
Market Testing
In market testing, we run two versions simultaneously, but instead of randomly allocating users, we determine user allocation by location. For example, users from the US will be exposed to the blue button while users from the UK will be exposed to the green button.
Market testing allows you to run a test in a controlled timeline “without” interference of internal factors, but it has other complications and is subject to market preferences. Each market can behave differently than others. A Black Friday promotion might impact differently on each market. National holidays can take place in one market but not in the other., and even extreme weather cases in one market may influence the performance as well.
Both alternatives can be powerful tools, even better than A/B testing under specific circumstances. But what makes A/B testing so special is that it controls for so many factors, both internal and external. It allows you to isolate the effect of the one thing we really want to test in a controlled experiment, such that if something unexpected happens, it happens in both variants. It is random assignation of users to variant A and variant B such that there shouldn’t be any bias between the experimental groups.
Statistics vs. Statistical Intuition
Before continuing, let’s talk about statistical intuition, not statistics per-se. Statistical intuition is inherent intuition that we implicitly use on a regular basis, but we just don’t think of it as statistics. Let’s review two examples:
I tend to check the weather forecast in the wintertime to decide if I should ride my bicycle to work or drive my car. If there’s a 5% chance of rain, I will probably take my bicycle, but what if there’s an 80% chance of rain? If I would prefer to ride my bicycle when there is 5% chance of rain and drive my car when there is an 80% chance of rain, it means that I understand probability.
I am a big fan of basketball, especially of the Hapoel Jerusalem Basketball Club. If I told you my team won 3 consecutive games, would you be impressed? How about 20 consecutive games? Obviously winning 20 consecutive games is much more impressive than 3. That’s implicitly understanding sample size and variability.
Statistics is a framework to quantify these factors, but to successfully run A/B testing, you don’t really need to know the formulas behind the statistics. All you need is to understand the concepts behind it: variability, sample size, and confidence.
How Variability Affects the Outcome
Imagine a world with no variability. Assuming 3% CTR when there is a blue button at the website, having no variability would mean that every day we see exactly 3% CTR. Say we changed the button to green and plot daily performance 30 days prior to the change and 5 days into the change. In a world with no variability, it would look as follows:
We went from 3% CTR to 4 % and generated a lift of 33% (4% over 3%).
But the real world doesn’t really look like that. We may have a graph that is more volatile and bounces around 3%. In a world with variability, it’s no longer that trivial to measure the effect. Even if we made the calculation, how confident would we be that the green button performs better than the blue button?
Variability adds uncertainty, and it’s so entrenched in a lot of the decisions we make. We are less certain when variability is high and more certain when variability is low. Standard deviation (STDV) is the most common measure of variability. It summarizes the distance of each point to the average. The higher the standard deviation, the more variability there is in a data set.
How Confident We Are of the Outcome
In our previous example, when changing the button color to green, we gained a lift of 33% in CTR. Should we permanently change to green? With the blue button, the CTR was 3% with 0.32% STDV, and with the green button, the CTR is 4% with 0.28% STDV. But in our data set, there are only 5 data points for the green button version, so we just don’t have that much confidence to conclude that its true CTR is 4%.
In this following graph, the black dots refer to the daily CTR, and the orange line is the running average CTR. The running average is 2% after one day, 3.9% after two days, and 3.5% after three days.
The more data points we have, the less this running average bounces around such that we don’t really see big spikes after a certain number of observations like we did in the first few. It’s pretty much consistent with the intuition about averages being locked after a while. So just knowing what the average of a sample is doesn’t really mean that it’s the true average. The more data we have, the closer this running average comes to the true average. Therefore, we should feel uncomfortable to determine that the green button performs better than the blue-button only after 5 days.
Quantifying Confidence
Confidence has a specific meaning in statistics, and it’s to measure uncertainty. It specifies how confident we are that when there is a difference between version A and version B, there really is a difference. A confidence level can take any number between 0% and 100%, with the most common being 95%. It also means we are allowing a 5% chance that we will conclude that there is a difference when there really isn’t. It is called a false positive (or type 1 error) and is denoted by the Greek letter α.
The higher the confidence level, the less likely we are to get a false positive as it means that we are allowing less room for error. But the higher the confidence level, the harder it is to achieve statistically significant results as it requires a larger sample size and/or lower variability.
Wrapping It All Together
Understanding variability, sample size, and confidence are important when running A/B tests. We are less certain when there is high variability. Conversely, we are more certain when we have low variability. Standard deviation is the commonly used measure of variability. The higher the STDV, the more variability there is.
We are less certain when we have fewer data. Conversely, we are more certain when we have more data. Certainty is an expression for building confidence in a data set. We are more confident with lower variability and larger sample size. We are less confident with higher variability and smaller sample size.
We measure uncertainty with confidence level. A 95% confidence level implies there is a 95% certainty that when we conclude there is a difference, there really is a difference. Conversely, it means there is a 5% chance that when we conclude there is a difference, there really isn’t (false positive). The higher the confidence level, the less likely we are to get a false positive. But the higher the confidence level, the larger the sample size and lower variability we need to reach for significant results.
Sr. Director of Product Operations @ Natural Intelligence
4 年Like!
Digital Growth Expert: PPC, SEO, Optimization, User Acquisition at idrory.me
4 年In-depth and on spot! So many companies are running a/b tests without understanding what they see and how to read the numbers. And the "cool graphs" the tools offer usually makes more harm than good when it comes to simplifying. A tip - Sample Size Matters! GG.
Data Engineer
4 年??
MBA, Risk Management/Compliance/Internal Audit/Quality Assurance, CISA
4 年Impressive. Good and valid insight points.