How to: Best practices of applying A/B testing in businesses.
A/B testing shifts the culture in businesses from decisions by persuasion to decisions by experiment. You enable your most junior people to test their best ideas. Good ideas and a fresh perspective can come from everywhere.
Brief explanation
A/B testing it’s an experiment with two groups of subjects to establish which of two treatments, products, or procedures is superior. If a standard treatment (or not) is used, it is called a control.
A proper A/B test has subjects who can be assigned one treatment or the other. The key is that the subject is randomly exposed (randomly assigned) to the treatments. This way we will know that any differences between treatment groups are due to:
1. The effect of the different treatments.
2. Luck of the draw.
Why have a control group?
Why not omit the control group and simply run an experiment applying the treatment we are interested in a single group and compare the result with the previous experience? Without a control group, there is no guarantee that “all other things are equal” and that any differences are due to the treatment (or chance).
5 steps when running an A/B test in a real business
1. Goals & Metrics
In a standard A/B experiment, we must decide on a metric in advance. Several behavioral metrics can be obtained and may be of interest, but if the experiment is expected to lead to a decision between treatment A and the other, a metric or test statistic needs to be established in advance. Selecting a test statistic after experimenting opens the door to bias on the part of the researcher.
Questions to answer at this stage:
Run A/A tests. You should be running lots of A/B tests, constantly innovating. However, if you don’t have a constant stream and there is testing downtime, you may as well run A/A tests. As you might imagine an A/A test pits a control group against another control group. What is the value of that?
2. Hypothesis
A hypothesis is a statement that gives us a starting point and evidence to prove if our research is valid or not. Without a hypothesis, our experiment would lack structure and it would be difficult to make sense of the results.
To ensure clarity and conciseness, it is essential to create a hypothesis that is straightforward and explicitly states what kind of metrics we expect to change. In the case of having multiple hypotheses, it is also advisable to list them all in order to preempt any follow-up questions that may arise once the experiment concludes.
Why not simply observe the outcome of the experiment and go with the treatment that works best?
The answer lies in the mind’s tendency to underestimate the extent of natural random behavior. One manifestation of this circumstance is the inability to anticipate extreme events. Another manifestation is the tendency to misinterpret random events as having patterns with some meaning. Hypothesis testing was devised as a means of protecting researchers against random deception.
Questions to answer at this stage:
You want your hypothesis to be logical. A/B testing is a method that is more valuable for optimizing known good workflows and designs. If we’re just throwing a bunch of bad ideas into an A/B test then it’s kind of a waste of everybody’s time.
3. Scenario Planning
4. Experiment Design
Think through the whole test before you run it.
In this stage, we’ll ensure that everything is okay to start the experiment. We’ll review some of the questions we have already answered. Although it’s a long list, as we run more and more tests, some of these answers will become standardized.
Make sure that you define before the experiment:
The first question to address is which visitors are eligible to be in the experiment at all. Some visitors may be excluded from the experiment completely.
The next question is to address how many of those should be sent to the treatment. Ideally, one splits the traffic 50/50, but that is not always the case. One common practice among novice experimenters is to run new variants for only a small percentage of users. This is bad practice because experiments will have to run longer. One should “ramp up” the experiment, increasing the proportion of traffic sent to the treatment over time to limit risk but ultimately reach 50% of traffic being diverted to the treatment.
There must be a reliable mechanism to assign visitors to the control or the treatment. That is, users must be assigned to the control (or treatment) both randomly and consistently. In terms of random, there should be no bias. Assuming a desired 50/50 split, they should be equally likely to end up in each variant.
One approach is to use a good random-number generator, pre-assign users to a variant, and store which variant they belong to in some database table or perhaps a cookie. We require that the user is consistently assigned to the same variant on multiple visits to the site. For instance, one could apply a mod or suitable hash function to each customer’s ID. Having a user switch variants will be confusing for them and muddy the data and its analysis.
What is and how to calculate it, are out of the scope of this article. However, let’s keep in mind that if you try to get away with a smaller-than-necessary sample size, you are likely to be subject to false results or you will fail to identify real treatment effects. There are different sample-size calculators for different situations. Use the tools available.
How long will the test run? If let’s say it’s an A/B test for a web, you can use average daily traffic divided by the total sample size to get an estimate of the number of days to run the experiment. If you happen to have lower-than-average traffic over those days, you must continue the experiment. We now have our sample sizes and time.
Or do we? If you ran the experiment for four days, from Monday to Thursday, would you expect the same effect, the same demographics of visitors and online behavior as if you ran it from Friday to Monday? In many cases, no, they differ. There is a day-of-the-week effect. Thus, if the sample size calculator says four days, it is often advised to run it for seven days to capture a complete week. If the sample size calculator says 25 days, run it for four weeks, and so on.
Don’t test multiple hypotheses in one variable.
Questions to answer (or already answered) at this stage:
5. Running the experiment
Assuming that you have the treatments implemented and the site instrumented to collect the data that you need, the issues of assigning individuals and starting and stopping the test still remain.
When you start a test, you can flip the switch and divert 50% of your traffic to the treatment. The problem is that if there are any major software bugs, and you present customers with a mangled, broken experience, you are likely to drive those customers away, and you’ve exposed 50% of your site traffic to that experience. Instead, you can take a more risk-averse approach and do a slower ramp-up, monitoring the metrics carefully.
? 1% in treatment for 4 hours.
? 5% in treatment for 4 hours (i.e., switch an additional 4% from control to treatment).
? 20% in treatment for 4 hours.
? 50 % in treatment for the remainder of the experiment.
Of course, if you do see an issue, there must be an abort button that can be hit.
Once you have ramped up the experiment and you are confident that there are no egregious issues, the best advice is to set it and forget it.
6. Experiment review
The teams that struggle more with making decisions are the teams that ones that are less clear about what they most care about.
Data can’t think long-term for you. It doesn’t make decisions. It is information that you need to inform your thinking. But if you react quickly without understanding what these numbers truly signify and how they align with your long-term goals for your product or users, you’ll end up making the wrong choices.
As we come to the end, I hope you found this article helpful. Your input is incredibly valuable, so please take a moment to vote and like it. Until next time, stay curious and keep exploring ??.