A/B TEST OPTIMISATION: EARLIER DECISIONS WITH NEW READOUT METHODS
Guillaume de Bénazé
Founder @ Datama ? Your Data mate - Ex-Bain, Ex-pedia, Ex-cellent!
This article is also available on Medium with more reference here
1. Introduction and context
Many companies that have something to sell do it through an online platform. It is thus mandatory to have an accessible, easy-to-use, and attractive website to maximise their revenues. One way to achieve that is to continuously improve their e-commerce website using the well-known A/B testing method : change one feature (or a few in the case of multi-testing…) at a time (ranging from the colour of a button to the whole payment procedure), allocate half the traffic to each version and conclude which is the best after a certain amount of customer sessions/time. However, since the expected difference in user behaviour between the two versions is often very small (a click-through-rate of 1.50% on page A vs an expected one of 1.53% on page B for example), the required sample size needed to reach a conclusion with a 95% confidence level is often quite considerable. This can, in turn, lead analytics teams to wait many weeks before deciding if they should implement the new feature and hence waste time and money.
With this in mind, a question naturally pops up: How can we optimise this A/B testing method to our advantage ?
One answer would be to end the tests earlier than the estimated required sample size. Indeed, it would then be possible to set up more tests and potentially improve our website more quickly. But one has to keep in mind that this sample size is there for a reason : ensure that our test has a certain statistical power given a chosen confidence level and an estimated minimal difference in user behaviour (for example, a test with a power of 80%, a confidence level of 95% and a detectable user difference of 0.03% between a click-through rate of 1.5% on page A vs 1.53% on page B requires 2 602 422 visits on each version of the website). Indeed, stopping a test earlier reduces its power and hence its ability to return the correct rollout. Simply put, a shortened test will be wrong more often. In the end, what we want to know is if the trade-off is in our favour: in time, will the company’s online revenues increase if it decides to set up more shortened tests with less accuracy ?
All the research necessary for this article was done with the support of Pierre et Vacances Center Parcs (PVCP), who accompanied us throughout our work and gave us access to some of their real-life A/B tests so that we could put our theoretical results to the test.
We’ve made a good talk with Fanjuan Shi, from Pierre et Vacances Center Parcs about this research, with great work of Adrien Salem here at DataMa. You can find the video here
2. Approach and methodology
The current statistical methods used by analytics teams are either frequentist or Bayesian with a 95% confidence level (CL). The required sample size is calculated based on an estimated uplift between the two versions of the website, and the decision is only made once it has been reached. This process is applied even if a significant result appears early on, to avoid early fake positive. But doing so prevents us from making early decisions and moving on to other tests, which can be in our interest in some cases that we will have to define.
The goal of this article is to challenge these approaches (i.e. methods, confidence levels and stop time). To do so, we have considered 6 methodologies (including the two previous ones) used in the web and pharma industries to make statistically backed decisions as well as different CLs (confidence levels) ranging from 95% down to 50%. We have applied them to both 5 sets of 30 simulated tests and 10 real-life tests run on PVCP websites of varying magnitude (from 5,000 to 500,000 sessions/week). We have chosen to keep the two current approaches in our study and take their efficiency as a baseline to which we can refer when comparing other methods/CLs.
From now on, we will stop a test as soon as we reach significance, regardless of the required sample size. As you will have no doubt already guessed, this significance (and the decision that it induces) will change with the different methods and CLs, and we will hence try to come up with the most accurate bundle/package of method + CL to replace the traditional approaches described above. However, to avoid stopping a test with insufficient data and making too many errors, we have decided to always wait for at least 100 conversions on each version and for a whole week of testing, in order to account for the real-life varying traffic throughout the week. Finally, the testing duration is limited to 12 weeks (i.e. 3 months) which is a standard duration in A/B testing.
Before we go any further, let me quickly describe the different methods that we will be discussing:
1/ Frequentist: Welsh t test used either once the required sample size is reached or as soon as a significant result appears.
2/ Bayesian: Same as frequentist but with a Bayesian test.
3/ Chi squared : Chi squared independence test is used to see if there is a significant difference between a certain category of 2 variables. In our case, we compare the number of conversions and sessions (categories) on the two versions of the website (variables). This method is rarely used in A/B testing.
4/ LD Bounds : We end the test as soon as the Z-score (same as the frequentist t-score) becomes either drastically too small (below the futility boundary) or too large (above the efficacy boundary). The test eventually terminates since the two boundaries meet at the end. This method widely used in medical trials where making a decision as early as possible is crucial.
5/ Permutation : We combine samples (a bunch of 0s and 1s where a 0 corresponds to a session without conversion and 1 to a conversion) together and randomly pick observations (still the 0s and 1s) without replacement to form new samples. Compute the statistic (ie difference of means=conversion rate in our case) on each permutation to yield a distribution and compare the original statistic to it. We use the p-value to conclude (as in the Welch t test). This method is especially convenient for multi-testing (more than 2 versions of a website).
6/ Bootstrap : Same as the permutation method but with replacement when forming new samples. This is another resampling procedure mainly used to assess the reliability of an estimate.
It is now time to put all these methods and CLs to the test ! To do so, we firstly compared the different roll-outs on isolated tests. But to be more precise, we then considered the ideal, assessed and real accumulated uplifts (IAC, AAC and RAC respectively) in conversion rate and then in actual revenues on a set of tests.
For each Methodology and Confidence level, we’re able to compare the assessed, ideal and actual uplift resulting from our AB test
Let me define those terms : for a given set of tests, the ideal accumulated uplift would be the maximum uplift that a company could get if its teams made all the right decisions, the assessed accumulated uplift is the uplift a certain method thinks we end up with and the real accumulated uplift is the one we really end up with based on the decisions made by a certain method.
In addition, we can add a duration factor by considering the time each method took before reaching the end of the set of tests. One way to do that is by calculating yearly ideal, assessed and real accumulated uplifts. These indicators show both the accuracy and the efficiency of the different methods at varying CLs. Keep in mind that all of this is only possible if many A/B tests of good quality are ready to be put in production !
3. Analysis of the simulations
3.1 Simulated test example
Before diving more deeply into the analysis of the simulation, let us first look at one isolated test. This is test number 16 of the 5th set of 30 tests, it has a real uplift of -0.6% and we have chosen a confidence level of 90% (i.e. an alpha = the probability of a type-I error of 0.1). Each arrow corresponds to the decision made after an additional week of data gathering (approximately 260,000 sessions per week in this example) : grey and small means no decision whereas coloured and large means that a decision has been made by a certain method. Finally, the x-axis corresponds to the current number of sessions as weeks go by and the y-axis to the assessed uplifts.
We can clearly notice that the roll-outs differ with the method : after only 3 weeks, the permutation method wrongly concludes a positive roll-out; chi squared is still indecisive after 12 weeks; the other methods are all correct but at different times… Obviously, since the real uplift is close to zero, there is a high chance of ending up with the wrong outcomes, all the more so if the decision is made early on.
3.2. Accumulated uplifts for a set of 30 simulated tests
Let us now focus on how each method fares on a whole set of 30 simulated tests given a 90% confidence level. The following graph shows the ideal accumulated uplift, which is the same regardless of the method, as well as the real accumulated uplifts as we go through the 30 tests and the assessed accumulated uplifts for each method.
While the real accumulated uplift is approximately the same, the time (i.e. number of sessions) taken by each method varies a lot. Furthermore, the assessed accumulated uplift is always higher than the real and ideal ones, especially for the bootstrap and Bayesian methods. This is something to remember: when an analytics team announces a certain estimated uplift, it will most probably be overestimated !
3.3. Real uplift vs Ideal uplift
Before moving on to the final and most conclusive graph, we will look at the impact of the confidence level on the accuracy. Indeed, for now, we have only considered a confidence level of 90%, which seems arbitrary. It seems natural that the accuracy, which is measured as the average of the real over the ideal accumulated uplift, tends to decrease as the confidence level decreases since the decisions are made with less data and hence less precision.
However, for very high confidence levels, the accuracy can be lower (chi squared, LD bounds). In these cases, the required sample size is often higher than the number of sessions reached within 12 weeks (maximum duration of a test) : we are thus ending many tests without any decision. Furthermore, the “full_freq” and “full_bayes” methods, which correspond to the current methods used (i.e. a frequentist or Bayesian test once the required sample size is reached and not before), show similar behaviours and do not seem to be affected that much by the choice of the confidence level. The permutation and bootstrap methods, on the other hand, have a better accuracy than the two former ones for high confidence levels.
3.4. Total gain in conversion relative to current methodology
Finally, it is time to look at a graph which takes all of this into account : the accuracy and the time taken to make these decisions.
The y-axis represents the average gain in conversion compared to the current situation (i.e. the “full_bayes” method with a 95% confidence level). The figures on the curves show the factor of more tests we would need to run if we chose a certain method and confidence level. As expected, decreasing the confidence level radically decreases the duration of the test. Hence, we can simulate the additional uplift that could be taken if we were to run more tests during the saved time. Overall, this widely compensates the loss in accuracy: decreasing the confidence level by 5% allows us to run almost twice as many tests, so theoretically, to double the final uplift, as long as these new tests are just as good. However, in real life, implementing twice as many tests of the same high quality is not that easy. All we are saying here is that, from a statistical point of view, the more tests we have, the lower the confidence level can be and the higher the accumulated uplift becomes.
3.5. Estimation is not reality
One last side note before the conclusion of this article the assessed uplifts are not reality. Indeed, as mentioned earlier, the assessed uplifts are always higher than the real uplifts, and this is ever more substantial as the confidence level decreases. Furthermore, while being much longer to run, “full_bayes” methodology is also much more accurate in the assessment of the implemented uplift (the ratio is very close to 1). Regarding all the other methods which stop tests earlier, the assessed uplifts are even less accurate. However, what really matters is the real uplift and not its estimation, so even if a method is not very accurate, it is worth trying out as long as it can allow us to run more tests with a satisfactory real uplift. In a nutshell, keep in mind that while assessed uplifts are always higher than real ones, it is not necessarily a drawback.
4. Recommendations and final conclusion
First outcome of our simulation is that the Bootstrap method seems to be the winning method for high confidence level (>85%). Switching from full Bayesian to Bootstrap could give PVCP an uplift of ~2% just by making the right decisions more often. While this method is hard to compute in a simple Excel spreadsheet, we’ve built a tool (DataMa Impact) that allows for doing it simply.
Second outcome is that surprisingly, based on simulated tests (150 tests), it seems that even at the current level of confidence (95%), waiting for the estimated duration could be avoided without much cost. If we respect the rule of having more than 100 conversions on each version of our website, we could already reduce the duration of tests by ~2 and more than double our real uplift while keeping the same accuracy in roll out decisions if we believe we can implement 2 times more tests of the same quality.
Furthermore, decreasing the level of confidence to 85% could allow us to divide the time of run of AB tests by another ~1.9, and, assuming that we could use this saved time to run more AB tests of the same quality, this could end up in another x1.6 increase in the conversion improvement compared to what we have today in a year.
Back testing this recommended approach on real-life tests shows that we would have taken the same decision in 10 out of 10 considered tests, while decreasing their cumulative duration by ~50%, which is reassuring.
One important note: changing methodology comes with a slight cost in measurement accuracy: the estimated uplift that we report on a yearly basis with current methodology is about +20% higher than the real uplift that we actually get on our site. This could go as high as +85% if we switch the methodology to bootstrap and decrease the confidence level to 90%. It is an important data point to keep in mind while measuring website progress and financial benefits/ ROI of the experimentation team.
One last key learning is that the actual uplift that we implement on our websites is about 85% of the one in an ‘ideal world’ where we would always make the right decision.
For the last time, all this increase in uplifts can only happen if many AB tests of good quality are ready to go: the limit is not set by statistics theory, but by the efficiency of a company’s AB testing teams !
Data, Product, Surfing
1 年Impressive article !! It reminds me a little of this blog post https://xavierbourretsicotte.github.io/early_peeking.html