The madness of CRO business case calculations

The madness of CRO business case calculations

"We have had 20 experiments last year where our challenger significantly outperformed the control. Average uplift was 5% on transactions. We've implemented all these winners during the year, but our transactions per week only slightly went up."

The nightmare of a CRO team being reviewed by management. The effect after implementation does not seem as big as expected. My question in this case is always: "Do you know your false discovery rate?". In non-statistical words: do you know how many of your measured winners are real winners? In statistical words: among your winners: how big is the percentage of false positives among all measured positives?

Let's take the following scenario:

  • A CRO team decides that a P value of 0.1 or lower means a win (in other words - if the confidence level is higher than 90% they declare the challenger a winner).

I don't want to dive in the discussion if this is wise or what it the confidence level should be. P value calculations can be done here: https://abtestguide.com/calc/

  • This CRO team has completed 128 A/B experiments in a full year. If all these experiments in reality only had challengers with no impact, then the CRO team would have measured 12.8 (rounded to 13) wins in that year (in theory, because of the 90% confidence level). This is what we call false positives - they don't add anything to the bottom line.
  • This also means that if they measured 20 wins in total in that year, they must have found some real winners, which we call true positives (which do add to the bottom line). At least 7 (20 minus 13), but these 7 real positives also drop the number of false positives, because there are only 120 - 7 = 113 experiment left who don't have any effect (probably leading to 11 positives outcomes who will be false positives).
  • In this example we are calculating with a Power level of 100% (meaning every real effect will be recognized (to make the calculations less complex). This is why the number of true positives can be measured by: (Measured Wins - (100% - Confidence Level) * Experiments) / Confidence Level. This leads to 8 true positives, and with 20 measured wins it means we have found 12 false positives (128 experiment minus 8 true positives = 120 experiments left, 10% found as win = 12 false positives).
So in the above case the false discovery rate is 60%*

*(12 divided by 20).

Sidenote: Sadly you don't know what the false positives and the true positives are among these 20 measured winners. You can apply statistical techniques to better understand what the false positives are among your measured positives and you could of course retest your winners to learn if the measured positive has a higher chance of being a true positive (if you have enough Power to reveal the expected uplift, otherwise you will end up with false negatives in your retest). But this all is a whole different story - this article is just to make you rethink A/B-test results business case calculations.

The main line in the given example is that by implementing all measured winners only 8 of them will add the mentioned 5% each, so the business case would be a 40% uplift (8 times 5%) and not 100% (20 times 5%), because of your false discovery rate of 60%! And of course 5% is the measured uplift on that given time, which could not be reality after implementation and the duration of this uplift is also not known (and hopefully your new wins were tested against old wins and not against your original control).

Also the chances of finding a true positive with a measured effect that is higher than the real effect is bigger than the chances of finding a true positive with a measured effect that is lower than the real effect (Type-M error, as mentioned in the comments on this article - this is a whole new topic, this article is focusing on the False Discovery Rate).

To sum-up:

If you do a CRO business case calculation, either up front (considering minimal detectable effect (Power calculations!) and current state of the digital channel you are optimizing), or after a certain number of experiments, most of us already:

  • Consider a specific length of time until the implemented winner will have a certain part of the measured effect (also covering for that Type-M error).

But you should also add:

  • Your false discovery rate - not every winner will be a real winner, causing the average impact of a winning treatments to be way less than expected (and this impact could easily be more than 50%).

So if you are looking at a business case on CRO including experimentation that looks bright and shiny: ask if the false discovery rate is part of the calculation.

Happy experimenting,

Ton

PS: Yes, we include the FDR and other error corrections in our business case calculations and yes we have learned that optimization and experimentation can still bring you lot's of money and growth on short and long term. Make sure though that the business case calculation is rock solid!

Guus Vullings

Freelancer (available) & Co-owner Glas123.nl

6 年

In the light of this article: what would be the best CRO strategy then? Run many more A/B test per year and focus on the quantity, or run less test per year, but make sure the quality of the test is better/makes more sense? Next to that: could you eliminate or reduce the chance of finding a false positive by letting a copy of the control run as a variant in the a/b test??

Bart Schutz ??????

Tourism Manager (for nature & rewilding positive tourism)

6 年

Assuming a decent power of 80% you'd expect ~98 true negatives, ~11 false positives, ~2 false negatives and ~9 true positives, and a FDR of ~54%... (slightly less bad than the mentioned 60%).? ? But most important: only 9,5% of your ideas were actually good ideas. Imagine you would have implemented all your ideas including those 90,5% useless and hurting ones without validating the effects... #nowthatsabusinesscase

Rob Helderman

E-commerce manager bij Royal Brinkman

6 年
回复
Lukas Vermeer

Product at Vista, Advisor at A/B Smartly, Keynote Speaker, Dad

6 年

I would also consider type-m error. Even if all results are real and tests properly powered we will on average still overestimate the size of the effect: https://lukasvermeer.nl/type-m

Lukas Vermeer

Product at Vista, Advisor at A/B Smartly, Keynote Speaker, Dad

6 年

The math doesn’t add up. > If they measured 20 wins in total in that year, it means that they found 8 real winners, which we call true positives (which do add to the bottom line), next to the already mentioned 12 false positives. The 12 is based on the assumption that the team ran 120 true null experiments, but by adding the 8 true wins we come to a total of 128. The logic also doesn’t add up, because it ignores the (very likely) possibility of real negative effects.

要查看或添加评论,请登录

Ton Wesseling的更多文章

社区洞察

其他会员也浏览了