How to: Best practices of applying A/B testing in businesses.

How to: Best practices of applying A/B testing in businesses.

A/B testing shifts the culture in businesses from decisions by persuasion to decisions by experiment. You enable your most junior people to test their best ideas. Good ideas and a fresh perspective can come from everywhere.

Inspiration:

  • Carl Anderson: Creating a Data-Driven Organization (Book).
  • Foster Provost: Data Science for Business (Book).
  • Peter Gedeck: Practical Statistics for Data Scientists (Book).

Brief explanation

A/B testing it’s an experiment with two groups of subjects to establish which of two treatments, products, or procedures is superior. If a standard treatment (or not) is used, it is called a control.

A proper A/B test has subjects who can be assigned one treatment or the other. The key is that the subject is randomly exposed (randomly assigned) to the treatments. This way we will know that any differences between treatment groups are due to:

1. The effect of the different treatments.

2. Luck of the draw.

Why have a control group?

Why not omit the control group and simply run an experiment applying the treatment we are interested in a single group and compare the result with the previous experience? Without a control group, there is no guarantee that “all other things are equal” and that any differences are due to the treatment (or chance).

5 steps when running an A/B test in a real business

1. Goals & Metrics

In a standard A/B experiment, we must decide on a metric in advance. Several behavioral metrics can be obtained and may be of interest, but if the experiment is expected to lead to a decision between treatment A and the other, a metric or test statistic needs to be established in advance. Selecting a test statistic after experimenting opens the door to bias on the part of the researcher.

Questions to answer at this stage:

  • What is the goal of this test? Figuring out what are the goals and metrics that you are trying to optimize. Clearly define the success metrics before the test starts.
  • We want to look closer at countermetrics. This helps ensure whatever you’re trying to optimize for doesn’t cost a ton of unintended behavior, that goes against your goals.
  • Priority and Impact. What strategic or business priority does this experiment have?

Desirable:

Run A/A tests. You should be running lots of A/B tests, constantly innovating. However, if you don’t have a constant stream and there is testing downtime, you may as well run A/A tests. As you might imagine an A/A test pits a control group against another control group. What is the value of that?

  • You can use it to keep tabs on your testing infrastructure and assignment processes.
  • If you see comparable sample sizes but very different performance metrics, that can indicate a problem in event tracking, analytics, or reporting. However, we should expect to see significant differences in A/A tests about 5% of the time, assuming that you use the standard 5% significance level. What you need to track is, throughout many A/A tests, whether you are seeing significant differences at a rate greater than your significance level.
  • Use the results of the test to estimate the variability of your metric in your control.
  • For those organizations, end users, and decision-makers new to A/B testing, it serves a useful lesson.

2. Hypothesis

A hypothesis is a statement that gives us a starting point and evidence to prove if our research is valid or not. Without a hypothesis, our experiment would lack structure and it would be difficult to make sense of the results.

To ensure clarity and conciseness, it is essential to create a hypothesis that is straightforward and explicitly states what kind of metrics we expect to change. In the case of having multiple hypotheses, it is also advisable to list them all in order to preempt any follow-up questions that may arise once the experiment concludes.

Why not simply observe the outcome of the experiment and go with the treatment that works best?

The answer lies in the mind’s tendency to underestimate the extent of natural random behavior. One manifestation of this circumstance is the inability to anticipate extreme events. Another manifestation is the tendency to misinterpret random events as having patterns with some meaning. Hypothesis testing was devised as a means of protecting researchers against random deception.

Questions to answer at this stage:

  • What is the hypothesis that you’re trying to test to optimize the goal you previously defined?
  • What makes sense?
  • Why is the reason that we think the experimental version will win?
  • Do we even know that any of the options that we’re planning to A/B testing is a good idea to begin with?

You want your hypothesis to be logical. A/B testing is a method that is more valuable for optimizing known good workflows and designs. If we’re just throwing a bunch of bad ideas into an A/B test then it’s kind of a waste of everybody’s time.

3. Scenario Planning

  • Draw a decision tree based on the scenarios that could happen after you run your experiments. There will be some different combinations of successes and failures along the way, and each combination will result in a different outcome. Usually, there are three outcomes: Ship, No ship, and Retest.

  • Make sure SUTVA is true. SUTVA’s explanation is out of the scope of this article. Nonetheless, as a brief explanation: Stable Unit Treatment Value Assumption (SUTVA) means that the treatment assigned to one unit (like a person or group) doesn’t affect the outcome of another unit. Imagine you’re testing two versions of a website: A and B. SUTVA ensures that if a person sees version A, it doesn’t impact the experience of someone else who sees version B. This is important because if SUTVA doesn’t hold, the results of your test could be skewed. For instance, if someone sees version A but knows about version B somehow, they might behave differently even though they didn’t see version B directly. This could mess up your results and make it hard to trust whether version A or B is truly better.
  • Ease of implementation. How easy are the options to implement?
  • Isolation. What other experiments are we running at the same time? We want to limit any confounding variables that might inflate or artificially limit the test’s result.

4. Experiment Design

Think through the whole test before you run it.

In this stage, we’ll ensure that everything is okay to start the experiment. We’ll review some of the questions we have already answered. Although it’s a long list, as we run more and more tests, some of these answers will become standardized.

Make sure that you define before the experiment:

  • Assignment.

The first question to address is which visitors are eligible to be in the experiment at all. Some visitors may be excluded from the experiment completely.

The next question is to address how many of those should be sent to the treatment. Ideally, one splits the traffic 50/50, but that is not always the case. One common practice among novice experimenters is to run new variants for only a small percentage of users. This is bad practice because experiments will have to run longer. One should “ramp up” the experiment, increasing the proportion of traffic sent to the treatment over time to limit risk but ultimately reach 50% of traffic being diverted to the treatment.

  • Randomized.

There must be a reliable mechanism to assign visitors to the control or the treatment. That is, users must be assigned to the control (or treatment) both randomly and consistently. In terms of random, there should be no bias. Assuming a desired 50/50 split, they should be equally likely to end up in each variant.

One approach is to use a good random-number generator, pre-assign users to a variant, and store which variant they belong to in some database table or perhaps a cookie. We require that the user is consistently assigned to the same variant on multiple visits to the site. For instance, one could apply a mod or suitable hash function to each customer’s ID. Having a user switch variants will be confusing for them and muddy the data and its analysis.

  • Sample size.

What is and how to calculate it, are out of the scope of this article. However, let’s keep in mind that if you try to get away with a smaller-than-necessary sample size, you are likely to be subject to false results or you will fail to identify real treatment effects. There are different sample-size calculators for different situations. Use the tools available.

  • Time.

How long will the test run? If let’s say it’s an A/B test for a web, you can use average daily traffic divided by the total sample size to get an estimate of the number of days to run the experiment. If you happen to have lower-than-average traffic over those days, you must continue the experiment. We now have our sample sizes and time.

Or do we? If you ran the experiment for four days, from Monday to Thursday, would you expect the same effect, the same demographics of visitors and online behavior as if you ran it from Friday to Monday? In many cases, no, they differ. There is a day-of-the-week effect. Thus, if the sample size calculator says four days, it is often advised to run it for seven days to capture a complete week. If the sample size calculator says 25 days, run it for four weeks, and so on.

  • Isolate variable.

Don’t test multiple hypotheses in one variable.

Questions to answer (or already answered) at this stage:

  • What is the goal of this test?
  • What metrics will be tracked?
  • Hypothesis testing or Bayesian Testing? In the case of hypothesis testing: What are your null and alternative hypotheses?
  • Are there more than one hypothesis being tested or possible confounding variables?
  • What are the treatments and control?
  • What are your control and treatment groups? How will be assigned?
  • When will the test start?
  • How long will the test run?
  • How was the sample size determined?
  • When will the analysis start and be completed?
  • What software will be used to complete the analysis?
  • How the results will be communicated?
  • How will the final decision be made?

5. Running the experiment

Assuming that you have the treatments implemented and the site instrumented to collect the data that you need, the issues of assigning individuals and starting and stopping the test still remain.

  • Do a slower ramp-up.

When you start a test, you can flip the switch and divert 50% of your traffic to the treatment. The problem is that if there are any major software bugs, and you present customers with a mangled, broken experience, you are likely to drive those customers away, and you’ve exposed 50% of your site traffic to that experience. Instead, you can take a more risk-averse approach and do a slower ramp-up, monitoring the metrics carefully.

? 1% in treatment for 4 hours.

? 5% in treatment for 4 hours (i.e., switch an additional 4% from control to treatment).

? 20% in treatment for 4 hours.

? 50 % in treatment for the remainder of the experiment.

Of course, if you do see an issue, there must be an abort button that can be hit.

  • Run the experiment until the minimal sample size has been achieved, or longer.

Once you have ramped up the experiment and you are confident that there are no egregious issues, the best advice is to set it and forget it.

6. Experiment review

The teams that struggle more with making decisions are the teams that ones that are less clear about what they most care about.

  • Review your plan scenario. The experimental procedure and analysis are very clean, almost clinical, and robotic -test A versus B, whichever wins, roll it out. If it were like this, it would be completely data-driven. However, the world is more complex than that. There are other factors at play. Results are not always clear-cut. There can be ambiguity. Maybe the treatment’s metric was consistently higher throughout the test but not significantly. Maybe there was a trade-off between factors. Maybe during analysis, you discovered a possible element of bias.
  • Balance the results with the long-term vision that you have for your product or your users.

Data can’t think long-term for you. It doesn’t make decisions. It is information that you need to inform your thinking. But if you react quickly without understanding what these numbers truly signify and how they align with your long-term goals for your product or users, you’ll end up making the wrong choices.

As we come to the end, I hope you found this article helpful. Your input is incredibly valuable, so please take a moment to vote and like it. Until next time, stay curious and keep exploring ??.


要查看或添加评论,请登录

Marcelo Cruz的更多文章

社区洞察

其他会员也浏览了