The benefits of structured A/B testing
I’ve seen firsthand how well-defined processes can transform teams into factories which efficiently deliver one effective experiment after another (I’ll define what I mean by “effective” later in the article).
The idea is to have documented processes for everything from hypothesising new ideas; to prioritising experiments; to determining what metrics we need for specific experiments; and more.
Having these processes means everyone involved with experimentation knows what they’re responsible for. They also know how to undertake their respective tasks so that they are done in a consistent and predictable way.
Since many of my articles involve processes, I thought I’d give you a demonstration of their power with a simple and fun experiment. If you’re interested, or if you need be persuaded about the usefulness of processes, read on.
The drawing experiment
During the lockdown, my 8-year-old son got into drawing. Not being super confident with his art skills, he started following Youtube video guides to help him create pictures of his favourite characters.
These guides present easy-to-follow processes that have viewers create great-looking drawings by following the step-by-step instructions. Take, for example, this drawing of Obi-Wan Kenobi:
The Youtuber in question talks through the creation of each line, communicating the purpose of each mark and describing how to draw the shapes necessary to recreate the character accurately. The entire process takes approximately 15 minutes.
I thought I’d use this video as an experiment for my son. He was thrilled to get involved!
The idea was to have him draw Obi-Wan twice. Once by referring only to the prototype drawing above; then a second time, going through the process presented in the video.
I’d then compare and contrast the two drawings to judge the effectiveness of the drawing process. Since it’s crucial to avoid subjectivity when reviewing the final works of art (because I’m going to love both drawings equally), I’m going to need some objective criteria to score each drawing.
Here is what I settled on:
- Number of details captured
- Number of lines captured
- Accuracy of shape and line
Note that the goal is not to create a great-looking Ob-Wan, but instead, to draw one that matches the prototype as closely as possible. Freestyling was not allowed for this experiment.
Here’s how it all went.
Drawing 1: drawing WITHOUT a process
The following is what my son achieved after about 10 minutes or so of drawing. He had the prototype up on the screen and tried to recreate it as accurately as he could.
It’s a great drawing, but let’s evaluate it using those objective criteria we defined earlier — yeah, yeah, I know, I’m a fun dad:
When put side-by-side, we can tell straight away that details are missing (our first criteria). For example, among other things, Obi-Wan’s boots are missing, as are his eyebrows, and his all-important lightsaber!
Overall, there are fewer lines than the prototype (our second criteria). Notice how there are fewer lines to define the hair, beard and folds in the clothing.
In terms of accuracy of line and shape (our third and final criteria), we can see one arm is bigger than the other, and the head shape is a little wonky.
So, can a process help achieve a closer match to the prototype?
Drawing 2: drawing WITH a process
Full disclosure: my son never made it to the end of the process. By minute 13, my son was keen to get back to watching Star Wars: The Clone Wars. This is why he missed out the lightsaber. You see, I was eating into his TV time with this experiment. But you’ll be relieved to know that I duly extended his TV time as a reward for entertaining my obsessive need to make a point.
Anyway, by following the video guide, he achieved this:
Already, you can tell that the two are supposed to look the same.
Reviewing it against the criteria, we can tell that the drawing has missed fewer details. For instance, Obi-Wan has eyebrows now. He also has boots. The lightsaber is still missing (as I already mentioned), but you can’t have everything.
In terms of the number of lines: there’s roughly the same number of lines as the prototype. There are lines in the beard and hair, as well as folds in the clothing.
In terms of accuracy of line and shape: the head shape looks closer, and the two arms are more comparable in size now, though the legs are a little far apart for my liking, overall the accuracy is pretty decent. Like I said before: you can tell that the two are meant to be the same drawing.
Overall, the win goes to the second drawing!
What all this means
Okay, so what if one drawing captures more detail than another? And so what if one drawing is closer to the prototype? Both drawings look perfectly fine, right? Furthermore, what does all this have to do with A/B testing?
Let’s tackle that last question first, and the rest will fall into place. You see, there are many nuances in running an effective test program.
First, let’s define what makes an individual experiment effective. We can say an experiment has been effective if:
- the experiment provides a read we can trust
- the experiment provides us with actionable insights/learnings
- the experiment either proves or disproves our hypothesis
This means if an experiment fails to meet the criteria above, then the experiment has failed to be effective.
Now, these individual experiments are cogs in a larger machine that we’ll refer to as the experiment program. An experiment program is when multiple experiments work together, each building on one another to create valuable learnings and improve our critical metrics (e.g. conversion rate).
Below is a list of things which make an experiment program effective. We can say a program is effective if:
- the program enables a high output of tests
- the program enables a steady flow of test results
- the program enables a test and learn culture
- the program focuses time and resources on experiments which have the highest return on investment
There’s more we can add, but these are a good starting point.
Failing to meet the criteria above means an experiment program will cost the business time and resources. It will also impact the volume of tests which are run. There may also be impacts on learning from the experiments in general — especially if you can’t rely on the test reads.
Overall, all of those criteria I’ve mentioned above constitute our Obi-Wan prototype. Not living up to this model means we have ineffectual experiments in a potentially failing program.
Details
Just like remembering to draw details such as Obi-Wan’s boots and eyebrows, a good experiment process means we avoid missing aspects of our build. These missing details can result in our test reads being invalid.
Imagine having skewed segments in our test groups. That would render the test read invalid. A process can help avoid that. Failing that it could highlight the problem to us so we don’t make expensive decisions based on a potentially wrong outcome.
The following are examples of some processes we could use:
- how to balance risk vs. testing big changes
- ensure the right metrics are added to an experiment
- prioritisation our backlog of experiments
Lines
Even if we’ve captured these details, missing intricacies is like missing the lines from the beard. For us, this can also result in failed experiments. Capturing these nuances helps us get closer to our idealised prototype model.
Examples: when deciding how to approach testing big changes or adding secondary metrics to your experiment, a process helps capture the nuances of specific scenarios, so you don’t miss those essential details. For prioritisation of experiments, a process ensures tightness and objectivity.
Overall accuracy
So then let’s look at accuracy. The drawing process ensured we drew a decent head shape and ensured that arms are the same size. Repetition and practice further improve skills in those areas. In the same way, having a process for experimentation means we enable greater accuracy by having everyone practice the same techniques. We also ensure consistency.
Examples: for designing big tests, setting up of secondary metrics, and prioritising experiments: a process helps create guardrails ensuring details are consistent and recognisable. This in turn helps debug and review experiments if issues arise. Processes also allow for learning of techniques and skills at a deeper level as they are practised.
Wrap up
When it comes to experiments, we either have generalists (resources who are responsible for a broad selection of tasks), or specialists (e.g. resources who are responsible for their specialised area of expertise). Processes help both.
If you’re a generalist, a process helps with quality and thoroughness. If you’re a specialist, it doesn’t necessarily mean you have experience dealing with experiments, in which case a process helps fill in those knowledge gaps and maintain consistency and (again) thoroughness.
Note: we must never underestimate the importance of consistency. It helps make reviewing and debugging easier, even allowing the creation of new processes to cover those aspects! All this is especially useful if dealing with multiple experiments from multiple teams.
You might have noticed that I’m a fan of processes. They’ve not only helped me roll out an experiment program across an organisation, but they’ve also enabled me to create multiple graphic novels — something I’ve long struggled with before I developed those processes.
No kids had their feelings hurt during the making of this article.?
I’m Iqbal Ali. Former Head of Optimisation at Trainline. Now an Optimisation Consultant, helping companies achieve success with their experimentation programs. I’m also a graphic novelist in my spare time.