A/B Testing: Planning and Executing
A/B testing, or a split test, is a method to measure the impact of a change in a controlled environment. It can be used to test everything from determining if a company should change its website layout, or if it should change the color of a specific button from blue to green. Well-planned A/B testing can make a huge difference in the effectiveness of your product and marketing efforts, but before running any tests, one should understand the essence of A/B testing and its results.
It all starts with deciding what to test. It may seem trivial, but one of the biggest flaws in A/B testing is not about the test itself; it is that often people aren’t really testing the right thing. Once we figure that out, we select a specific target metric along with the supported metrics we wish to optimize and set the expected test duration. We then prepare a test plan; at the end, we analyze and interpret the results to make data-driven decisions.
Testing the Right Thing
Setting a proper A/B test takes a lot of time and effort, so we want to make sure that our test is worth it. If we want to have the biggest impact, we need to make sure we always focus on solving the most important problems. Some product managers often come with test ideas on features that may apply to a fraction of users. I always say that just because we can fix something, doesn’t mean it’s worth fixing, especially if we consider the cost of not using resources to fix something of higher impact.
Focusing on impact means that we focus on what can generate as much money as possible. It starts with understanding the top-level goal and digging into the performance metrics that make up that goal. Is there a segment with a high opportunity that is currently underperforming? It could be anything, such as geography, demographics, and new or returning users.
Setting Test Duration
Estimating how long to run a test happens before we even started any testing because it helps with the scope and plan of the test. It also helps with selecting a target metric and it keeps us from self-biased influences.
Defining test duration is basically statistically quantifying the needed sample size by understanding the simple principles I have talked about in my previous article: variability and confidence. I don’t really want to get into the bits and the bites of the formulas that are generating the requited sample sizes, but there are many online calculators that can run that for you; Even Miller's Calculator is a good option when the selected target metric is a ratio (such as CTR, CR), or Statulator for continues variables (such as ARPU, CAC). Each calculator spits out the required sample size based on the desired lift, a given variability and confidence level.
Assuming we'd like to run a test. The desired ARPU lift is 5% at a 95% confidence level. Based on the historical data and variability of our dataset, the calculator determined that detecting a 5% lift requires 10,000 users per variant, which is 20,000 users total. If our daily number of visiting users is approximately 4,000, then the test should run for 5 days (20,000 users divided by 4,000 daily users).
What if users’ behavior varies by the day of the week? What if on the weekend, users are more likely to convert? Or what if we have a weekly push notification every Monday? Will we get a surge of visiting users every Monday?
Some businesses have strong weekday elements, some don’t. But because day of the week could be a factor, and because we want to have enough data to trend over, I recommend a minimum period of 1 week for running an experiment so it will complete at least 1 weekly cycle, even if strictly speaking from a number of observations, the calculator suggests less.
Selecting a Target Metric
There are two classes of possible metrics to select:
- Means: such as ARPU (average revenue per user), CAC (customer acquisition cost), CPI (cost per install), etc.. These are continuous variables that can take any value continuously.
- Proportions: such as CR (conversion rate), CTR (click-through rate), etc.. This is a ratio that describes a percentage value associated with a population; it can take any value between 0 and 1.
All these metrics are important and useful, and there is a logic behind selecting each one of them, and that’s what makes experimentation tricky. Selecting a target metric depends on what we want to optimize eventually.
There are some considerations that will help us decide what metric to select. First, we need to understand what our top-line goal is, and what we really care about. If our goal is improving revenue, then ARPU should probably be the selected target metric as it normalizes revenue per user. Second, we need to ask ourselves, which metrics most directly impact the experiment? By changing the button color, we are most directly impacting its CTR. But what if the metric that we most care about is ARPU, because that’s where the money is, and the metric that we most directly impact is CTR. How should we choose one over the other? Well, it depends.
Generally, the metrics we most care about are at the end of the funnel. However, not all users progress down the funnel. Selecting metrics at the end of the funnel would probably take a longer time to converge not only due to the smaller sample size but also due to the high-variability nature of continuous metrics. By selecting metrics at the beginning of the funnel, we decrease the time to converge by increasing sample size (more users at the beginning of the funnel) and by decreasing variability (users can either click or not).
But what if our test results showed that we improved CTR and worsened CR, would that be such a successful test? Although CTR is the most directly impacting metric, and although there can be a significant lift in CTR, we must always verify that the test didn’t have a negative effect on advanced steps down the funnel that will eventually negatively affect our top-line goal (such as ARPU). It as simple as that: while more users are clicking, fewer users are converting, and that is not such a successful test after all, although statistically speaking, the test won significantly.
Checking Results
Can we really make the entire test duration without peeking at the results? Some would say to never peek, and there is a reason for that. Roughly speaking, at a 95% confidence level, every time we check results, there is a 5% chance of false positive. If we keep peeking until we get the result we want, we’re increasing the chance of false-positive and invalidating the conclusions one might draw from this experiment.
But what if something is broken? What if it’s causing a terrible user experience or not really testing what we intended to test? What if the test is underperforming or going over its designated budget? Would we really want to wait to find that out?
I advocate for “educated peeking.” When we first run the test, I advise to start by risking a small portion of the data, such as 10%-20% of the users depending on how sensitive we are, to verify that the test mechanics and setting are running as expected. After 1-2 days of a “dry run,” and if nothing unpredicted occurred, we officially start the test with the designated split (normally 50-50). This is where the real test period starts, and we will never take into consideration the dry run period when analyzing test outcomes; otherwise, we may skew the test results due to proportion changes.
Educated peeking also means to set periodic peeking points to ensure that we are not going over the budget and that there are no sudden disasters. If there are, either raise a flag or immediately end the experiment.
Do not draw conclusions about the hypothesis during the dry-run and the test period. It’s more about validating the mechanics being correct, not about checking results. If something is seriously amiss, end your experiment without any conclusions, fix what needs to be fixed, and relaunch it.
Interpreting Results
There is more to A/B testing then plugging numbers into a calculator. Some people mistakenly think that A/B testing is a straightforward tool that spits out exactly what we should do next as if we crunch some numbers, and at the end, it tells us what decision to make. That’s not the essence of A/B testing.
So how do we make good decisions based on A/B results? First, we need to accept one simple fact: we are all biased. We want our experiments to win. Not because we’re cheaters, but because we want to make an impact on our business. We may ignore trends if they muddle the results, or we could overly rely on trends if they support the hypothesis. These are natural impulses. Therefore, we should challenge our test results and not work too hard to approve what we want or disapprove what we don’t want.
Interpreting results to make a decision is more of an art than a science. It requires critical thinking and shouldn’t just act on a calculator readout. We need to think through all the information that we have: test result output, daily trends, supported metrics, and long-term impact of the test on our business.
Wrapping It All Together
This is really the guts of A/B testing. It all starts with figuring out what to test next, and where the most opportunities are. Research those opportunities, generate information and ideas of how to tackle the opportunity and prioritize the list. Always focus on solving the most important problems.
Quantifying an expected sample size helps to plan the test duration. Based on the desired lift of a selected target metric, variability, and confidence level, we transform the required sample size into a test duration period. To have enough data to trend over and to dismiss day-of-the-week influences, I recommend completing a minimum period of 1 weekly cycle for running tests, even if the calculator suggests less.
When the expected test period is extremely long, such that it will not be effective to run an experiment, selecting metrics at the beginning of the funnel might be worth considering. It may be risky, and we should always verify that the experiment has no negative impact on the metrics we really care about that are at the end of the funnel or that it doesn’t skew long-term results. Otherwise, selecting end-of-the-funnel metrics such as ARPU should be selected.
A 1- to 2-day dry run period on a smaller data set is recommended to validate the test mechanics, so we are testing what we really wanted to test. If no disaster happened, the test officially starts. While some advocate for a no-peeking rule during an experiment, I strongly encourage setting educated peeking rules, but do not draw any conclusions before the expected test duration is over.
Program Manager at GE Healthcare
4 年Thanks for sharing. How do you suggest implementing it on a start up with lest traffic and less data?
Learning & Development ?? OD Manager ?? Managerial Consultant ?? Leadership Trainer & Coach ?? Training Development ??
4 年Interesting. According to the illustration, you are continually moving from one test to another?in your life :-)
Performance Marketing Leader | B2B & ABX Strategist | Certified CRO | AI-Driven Growth & Automation Expert | Ready to Scale Your Business? Let’s Connect!
4 年Very interesting, thank you for sharing Shir Varon
Senior Business Analyst
4 年Thank you for sharing