Yes. We need to attribute revenue to experimentation. Here’s how: Part 2
David Mannheim
Made With Intent | 2x Founder | Author | Keynote speaker about "Personalisation"
TLDR; Start to re-frame the conversation about experimentation. Talk about expected loss as much as expected gain, use OKRs to focus and align, talk in ranges, use degradation, highlight your program quality metrics.
In the first part, we spoke mostly about communication (Part 1 is here, read that first). Being honest, I split this article up because it was just too damn long. This is also, really, about communication ?? It's all about communication. But, more specifically, re-framing the conversation.
Whether you're talking about an individual experiment or a series of experiments (a program, if you will) how that is communicated will impact how it is received. In my experience, as someone who has dealt with a breathe of stakeholders; junior to senior, practitioner to C-suite, mature to immature, I feel that there are certain strategies you can utilise to re-frame the discussion of experimentation away from revenue.
Inherently, that educates the client by prioritising on metrics and attributes that aren't directly (perhaps indirectly) associated with the program itself. A cause and effect, if you will. So previously I spoke about defining your purpose of experimentation, setting standards, accuracy thresholds and telling stories. Today, I'm going to talk about:
- Expected loss
- OKRs for focus and alignment
- Talk in ranges
- Using Degradation
- Program quality metrics
Expected loss
What about open discussions on the alternatives to experimentation? What is the cost of not experimenting?
I admit, I think this is a hard sell. In experience, to companies, it feels like a hard sales pitch by associating negative connotations to not doing a very valuable practice, instead of displaying the positives of doing experimentation. And I know, there are the concepts of loss aversion and that golden ratio of 1.5 to 2.5, but just from my experience, I’ve never seen it work very well.
But we can explore these alternatives. What about a website redesign? What about placing budget into acquisition channels like PPC? What about dumping loads of money into organic? Interestingly, what about their associated levels of revenue attribution?
I sometimes like to talk in measurement terms and gear the conversation towards efficiency, rather than 'loss'. As an experiment validates a variable through measurement, when you don't experiment, you clearly don't measure. Asking teams how well that feature performed or how that deployment faired or how well utilised that feature is sparks the conversation. Or comparing against a discussion of measurement elsewhere in the business. If we measure and guide our decisions of acquisition based on data, why don't we do the same with onsite behaviour? (etc)
Experimentation is as much for protecting revenue, as it is incrementing it. Failed experiments are effectively saving a business from productionising shitty experiences. Firstly to stop poor UX being deployed to production that would have cost the business money. But also, in the time saved of your R&D team - where, quite often, the effort to deploy an experiment is a fraction of the effort to build a new experience” Luke Frake
When it comes to individual experiments, we could re-frame discussions to talk about how a, quote unquote, experiment 'saved us' losing money by not releasing it. ABtestguide.com calculates the business impact of a Bayesian result in 6 months, where you could balance revenue yield against chance. This is a nice little tool.
This conversation is geared towards, as much protecting revenue, as it is gaining it.
It can also be a conversation geared towards efficiency of development. In one example that comes to mind, when working with Travis Perkins, "instead of embarking on a lengthy card-sorting process that would have taken us months to re-categorise and implement, we chose to redesign the navigation; how users interacted with it. We saved our team from an estimated £95k on a solution that took just 3 days to build" That "£95k" was worked out by estimating the resource needed to plan, restructure and analyse the build requirements as a cost per day and no. of days. Rather arbitrary, I admit, but it gave a scale rather than an accuracy, and helped reframe the conversation.
OKRs for focus and alignment
Want to know one of the best decisions we ever made at FKA User Conversion (in my humble opinion)?
To say reading Measure what Matters was life-changing is an understatement (and I read a lot around that subject of objectives and key results to be sure). Implementing OKRs for our conversion programmes allowed us to tell a story or narrative around where we’re heading that wasn’t necessarily routed in revenue gain. Again, it helped reframe the conversation.
Don’t get me wrong, it was difficult (and still is) to implement OKRs; well. Some businesses don’t have this methodology. Some don’t even understand their objectives. We were dealing with, largely, trade business meaning that everything is an objective or priority and usually in a fast-paced environment. Individual consultants have different styles and OKRs are fairly process driven. That being said, by calling out “focus and alignment” as the key to unlocking a re-prioritisation of outcomes, the method of doing that (i.e an O and KRs) wasn’t as important as what the conversation geared towards.
By framing success as the confidence value of achieving specific objectives and specific key metrics, the focus is taken away from revenue attribution. Equally, it helps the wider team inc. stakeholders understand the direction of the optimisation program, as well as appending focus and providing alignment.
Talk in ranges
This seemed like a fairly obvious solution to the 'problem' when it originally reared its head. And it's a tactic, after speaking with a lot of practitioners, that's well used.
"An experiment doesn’t mean you’ll make 3.9% revenue uplift, it means with 90% certainty you’ll make an uplift between a range - point A and point B. This is what confidence intervals are and dictate" Hazjier Pourkhalkhali
Hazjier recommends to “always err towards the lower side of that range as a matter of best practice”
Because we work with confidence values, presenting ranges back to your stakeholders makes perfect sense. It alleviates the issue to a degree, in that you’re showing it’s not an 11.96% uplift on the experiment but somewhere between 9.5% and 14.2%. It still needs the same degree of qualification, it’s just a more visible way of making the point. As a result, we generally provide forecasts between two values, rather than hanging our hat on a single value; again, it sets expectations.
We’re ultimately talking about presenting a range of values that the data hasn’t ruled out with a degree of certainty. Visualising this, too, will help stakeholders understand the risk associated with implementing the experiment.
We don’t work in a world of exact numbers, we work in a world of ranges. So at best we can say to a degree of certainty that an experiment will have increased a metric by roughly our confidence intervals” Luke Frake
Using Degradation
We spoke in the previous article about not collating experiment attribution together because, "you’re adding fuzz on top of fuzz”.
And in the previous section we talking about ranges, designed to showcase the conservative estimate. Why? Because it’s better to over-deliver than over-promise. For this reason, we always add degradation into experiments.
That rate should be contextualised to the business, however.
“I have not seen a better take yet than just having diminishing returns from 100% to 0% in 1 year. For one reason or another, a gold standard has emerged of including an impact of 6 months in business case calculations (we contributed to this ourselves - we often presented: 1 year of effect that expires linearly, which is equal to 6 months of impact at implementation immediately after the test)” Ton Wesseling
Why does the value of a test degrades over a number of months? The novelty made fade, or a competitor may copy; we regress towards the mean.
In particular, that point about novelty fading is an interesting one, and something that I tend to discuss within my talks around creativity.
- We are conditioned to variables. Like, how a product page should be structured (thank you Amazon)
- ...when, we are conditioned to these variables, we become desensitised to them
- ...meaning the impact of that stimulus is reduced.
I have no proof on this whatsoever, might I add, apart from a small sample size of individual experiments achieving the same thing.
How much does it degrade? I don’t know. The below is from a talk Ton did in Manchester (great city, one great football team, let's not talk about the Europa League Final please) about creating new baselines and standards based on cumulative experiment attribution.
Cumulative collective experiments could be contextualised in some form by degrading based on sales lifecycle and seasonality.
Taking the below example from a client at FKA User Conversion, we noted that the cumulative value of the experiment degraded over time, but loosely contextualised it to the business lifecycle. We knew there were peaks in sales as different seasons, new streams of traffic based on TV advertisements etc.
Still, I agree, I don’t think there’s a ‘better’ way (at least that I have seen) than just diminishing the return of an experiment to null after a period of time - I’ve actually done 6 - 8 months in some cases; all in the interest of being conservative in the hope of over-delivering rather than over-promising.
Program quality metrics
To remove the focus of revenue attribution, we must re-frame the focus towards something more attributable. Something that defines performance. Something that showcases we're moving in the right direction, at least.
“The most important KPI was the number of tests you are running because, naturally, the more tests you have, more winners you’ll have. The second KPI is around the time to implement - if you have a lot of successful experiments they’ll take time to implement both in terms of development and also in terms of time to go live. Finally, the third KPI is success rate. If your success rate is too low then the chance that you’ll input a false positive gets bigger and bigger.” Annemarie Klaassen
In the case above, Annemarie Klassen noted that she wanted to focus on the quality of the experiment program itself, by focussing on 3 x main KPIs:
- Number of tests
- Time to implement
- Success rate
On that last point, FYI, plugging once more the greatABtestguide.com; they have a false discovery rate calculator for your experimentation program which I would recommend.
All the way back in 2015, Paul Rouke wrote an article for eConsultancy on Vanity vs Sanity metrics (it's a good read, give it a shot, even I remembered it to this day nearly 6 years later). Here, he spoke about the 'sanity metrics' for an effective CRO program being the percentage of tests that deliver an uplift, the average % uplift per test, the percentage of successful tests that deliver over 5% CR increase, the percentage reduction in cost-per-acquisition (CPA)
....and even... The percentage ROI per test. Let's not dig into that last one ??
Indeed, Booking.com take a similar approach in a recently released article about their KPIs around ‘experiment quality’. And OK, we’re talking about Booking.com here - I would question (and I don’t know the answer) how important revenue attribution is to them on a per experiment or collective basis. Their maturity is as such that they require KPIs that measure good product decisions. They therefore focus on:
- Good design (things which happen before the start of an experiment)
- Good execution (mainly about the planned experiment duration and the adherence to that plan)
- Good shipping (validates that the decision is in line with the shipping criteria.)
Summary
How should you attribute revenue to experimentation? In part 1 we spoke about 'communication'. How to best communicate
- Align and define on your purpose of experimentation
- Set Standards and therefore expectations
- Define what accuracy is acceptable
- Tell Stories
In part 2, I would argue this is more about 're-framing the conversation' (but really #3 and #4 should be in the first series). I told you, I just split up my list of 1 - 9 because it was super long and no one would read this. Hell, I'm impressed you've read this far ??
- Expected loss
- OKRs for focus and alignment
- Talk in ranges
- Using Degradation
- Program quality metrics
My final final summary about this "Experimentation Revenue Attribution Series"
I really hope this series has helped, at least, spark some debate and encourage thought. I too, struggle with this and so do many others. You're not alone. The "theory" often doesn't match the "practice" and I'm trying to balance the two; often slipping into practitioner and evangelical mode. Sorry about that.
I strongly believe that there is an over-indexed view of revenue attribution towards AB testing. And that has over-simplified the art of it. It’s commercialised this methodology, and sometimes for personal gain or greed. Falsified case studies, intentional or not. Breeding bad practice after bad practice. Creating mis-aligned thought and education.
Remember SEO in c. 2015? Stakeholders mostly cared about "getting to the top of Google". Which bred the same as above; falsified case studies, bad practice and poor education for years to come. But those who really cared about the craft spoke of presence and authority.
I feel we are at a similar inflection point within the CRO (specifically AB testing) industry in reference to our craft.
In most cases, it has diminished the importance and investment of, what I would consider to be, one of the most valuable practices within business.
Forgive the cynical old man view but it is the single reason why I embarked on this discovery nearly 3 years ago; because I care about the craft and really want others to see it's true potential. By over-indexing revenue attribution, I feel that we over-simplify and therefore won't reach that point.
Experiment revenue attribution, in an ideal world, should not be done. It's not the purpose of the methodology. But we don't live in an ideal world. Our world is contextual and complex. So, when doing so, we must have guidelines that educate our stakeholders over time and ensure acceptable levels of accuracy, to help re-frame the conversation as to the true purpose of AB testing
My favourite quote in the whole series is all about humility and empathy with others. Just because we believe something to be true does not mean we should stomp our feet until we get it. Thank you Matt.
Ultimately, it comes down to approaching conversations with empathy. It's unreasonable to expect the FD or commercial team to be as knowledgeable about experiment driven product development as us. We'll get the most buy-in when our story is simple, compelling and delivered with humility. Matt Lacey
Here is a list of the history, the why, and the how (part 1). A huge thank you to all the contributors both in quote form, those who reviewed the articles, and who sparked debate.
This is part of a series of articles about revenue attribution to experimentation:
- Why do we assume experimentation is about financial gain?
- No, you can’t accurately attribute, nor forecast, revenue to experimentation. Here’s why.
- Yes. We need to attribute revenue to experimentation. Here’s how: Part 1
- Yes. We need to attribute revenue to experimentation. Here’s how: Part 2
Made With Intent | 2x Founder | Author | Keynote speaker about "Personalisation"
3 年Contributions from ?? Luke Frake, Matt Lacey, Hazjier Pourkhalkhali, Ton Wesseling, Annemarie Klaassen - thanks for the wise words of wisdom!