How Experimentation Can Lead to Product Success
Suman Guha
India Tech Icon 2024 | Award winning CIO Digital Transformation | Technology at Tata CLiQ & Luxury | Serial CTO/CPO, ex-Reliance Retail, Tesco, Red Hat, Cisco | eCommerce, SaaS | AI Fellow
The most reliable way to gather data on a potential change to your product is to run a controlled experiment. With software products, as well as ads, websites, and marketing emails, A/B testing gives you this power. If you’re the first PM for your product or even at your company to A/B test, getting set up and up to speed can be arduous, but it’s well worth the effort. In this article, I will discuss how to use the power of A/B testing and how to get started. But wait why should you learn it!
Let me help you build reasoning! Do you know ?
Over 148,113 companies are using A/B Testing tools
Source of Marketshare figure 1.0 url
Major corporations such as Amazon, Airbnb, Booking, Jio, Google, Meta, Intuit and many others use Experimentation and A/B Testing. For instance, Booking.com runs more than 1,000 rigorous tests simultaneously and, by my estimates, more than 25,000 tests a year. At any given time, quadrillions (millions of billions) of landing-page permutations are live, meaning two customers in the same location are unlikely to see the same version. All this experimentation has helped transform the company from a small Dutch start-up to the world’s largest online accommodation platform in less than two decades.
Booking.com isn’t the only firm to discover the power of online experiments. Digital giants such as Amazon, Facebook, Google, and Microsoft have found them to be a game changer when it comes to marketing and innovation. They’ve helped Microsoft’s Bing unit, for instance, make dozens of monthly improvements, which collectively have boosted revenue per search by 10% to 25% a year. (See “The Surprising Power of Online Experiments,” HBR, September–October 2017.) Firms without digital roots—including FedEx, State Farm, and H&M—have also embraced online testing, using it to identify the best digital touchpoints, design choices, discounts, and product recommendations.
Indian market leaders are equally using experimentation in their product development culture, if you study companies like Jio, Cred, Flipkart, Swiggy and several others use experimentation.
There is a brilliant post on Experimentation at Airbnb by Jan Overgoor.
Anyway now that you know the reasoning, let’s dig in to discuss key topics that will allow you to build a mental model to create a roadmap for planning, running and understand your first experiment:
off-the-shelf solution, you can segment users, test out different versions of the product, and view the results
Running A/B tests is an exercise in patience and prioritization. You’ll want to test many things; you can test only a few at a time. You’ll want answers quickly; most tests will take weeks to generate significant results. You should also be prepared for ambiguous results. Some changes will boost the target metric but tank a core product KPI. Many tests will have a positive but statistically insignificant impact that likely just reflects a novelty effect.
It’s not all frustration, though: the excitement of finding a change that unambiguously knocks it out of the park is tremendous. And with A/B tested product improvements, it’s easy to pinpoint the impact of your choice on product outcomes
Why A/B testing matters?
Despite their limitations in the types of changes they work well for, A/B tests are unparalleled in their ability to provide clean data. They allow you to try out and launch small but real product improvements. Optimizing many parts of your product in small ways can lead to a large aggregate impact on key metrics like DAUs, conversion rates, and revenues. The sorts of questions you can answer with A/B tests include:
What are A/B tests?
A/B tests are one of many user research tools you can use to evaluate potential changes to your product. Companies like them because, unlike in-lab studies, they produce statistically significant results with large samples. Most large apps and websites are constantly running dozens of A/B tests to help them optimize the details of their product.
An A/B test is a controlled experiment. You randomly sort users into buckets and give different buckets different versions of the product. (The “A” is generally the current version or control; the “B” incorporates a proposed change.) This allows you to compare how users respond to the variants and decide which ones you should incorporate into the product going forward. Despite the name, A/B tests are not limited to two product variants — it’s possible to try several options at once. The number of variants, split of traffic, and size of your user base determine how long you’ll need to run a test to get statistically significant results.
Although the graphic above shows split of traffic between “A” and “B” for illustrative purposes, you will usually send the majority of traffic (95%+) to the control product (let’s consider “A” is your current offering). You want to feel free to test variants you aren’t sure will succeed, but sending 50% of your traffic to a version with a risky change could drastically hurt overall product performance.
One variable at a time
The number one rule of A/B tests (and pretty much any experiment) is to only manipulate one variable at a time.
Say you’re changing attributes of a green call-to-action button that you want more users to click. If your “B” version of the page changes the button so that it’s red and larger, your results will be inconclusive. If the “A” version does better than the “B” version, you still don’t know whether a larger green button (or a same-sized red button) would have outperformed your current design. If the “B” version does better than the “A” version, you don’t know which change contributed to the improvement: the color change, the size change, or both.
In this case, it would be appropriate to run a multivariate test with four buckets — a small, green button; a large, green button; a small, red button; and a large, red button. You can also use multivariate tests ****to try things like different combinations of page headers and images. The key is to have a test bucket for every possible combination, not just two buckets with multiple variables changed between then.
What questions can’t A/B tests answer?
Conversely, an A/B test won’t answer questions like:
These are badly suited to A/B tests for a few reasons:
Some companies have a policy of testing every change before launch. For instance, an advertising platform might insist that any changes run on 1% of traffic for a week before rolling out to the other 99% of traffic, to confirm that they don’t hurt revenue. Thus, they would “test” changes we don’t recommend using A/B testing for. These aren’t A/B tests in the sense that we mean: they’re safeguards rather than tests of specific hypotheses.
The amount of traffic your product gets also determines whether you can use A/B tests to answer questions. With a small user base, it will take a very long time to get statistically
significant results. A/B testing is best suited for products that get thousands to millions of daily users.
How to plan and run A/B tests?
What product change do you want to try, and what metric(s) do you expect to change as a result?
Don’t just randomly try making changes and see how your KPIs are affected; this will randomly get you false positive results. Instead, be specific about both sides of this question.
Examples of good hypotheses:
Changing our call-to-action button from blue to red will increase the percentage of site visitors who click on it
A 10% discount promotional offer will yield more purchases and (proportionally) higher revenues than our standard pricing with no promotion
Placing the “compose” button at the bottom of the screen instead of the top (its current position) will increase the percentage of users who start drafting a message
Your hypotheses should call for one-sided statistical tests. That is, you should hypothesize not just that a product change with affect a metric of interest, but that it will increase or decrease that metric.
Don’t forget to make sure your product is instrumented to gather the data you need! If you’re not tracking what percentage of users click a particular button, you need to fix that before you can experiment with whether variants will increase the click rate.
The “A” group will be directed to your current product, but you need to build the “B” (and maybe “C” and beyond) version(s) of the product before you can send users to them. If you’re simply changing a color or a string of text, this is easy. However, if you’re changing pricing, that needs to be reflected on your payment screen and in the transaction back-end. If you’re trying out different layout or icons, you’ll need to work with designers to generate those. Scope out this work and plan for it in your product roadmap.
If you have a statistics team to help you, call them in for this part!
You have three important questions to answer here:
What’s your audience for this test? If your product has millions of users all around the world, do you hypothesize that this change will improve the product for all of them? Maybe it’s geared towards European users only, or maybe it’s a text change that you only want to test on English-language users. Perhaps you only want to try a
pricing promotion on users who already have accounts and are logged in. Regardless, you need to get clear on your test audience (and its size) before you can answer questions (2) and (3).
What percentage of traffic should you send to the “B” (and “C”, “D”...) versions? The more traffic you send to the variants (up to an even split), the faster you can achieve statistically significant results. But there’s some chance that your variants are considerably worse than your current product. Are you willing to risk sending a large portion of your traffic to an experimental version of the product? Users also tend to be averse to change — if you use large “B” buckets in A/B tests, your users may be annoyed when they perceive constant product changes. Some companies institute standard policies requiring that, say, 90% of traffic be in the control group. Thus, you could send up to 10% of users to the “B” version in an A/B test, or up to 3.33% of users to each of the “B”, “C”, and “D” versions in a test with four buckets.
How long do you need to run your test (in order to achieve statistically significant results)? This will depend on how much traffic the relevant part of your product gets (your sample size), your answer to question (1), your definitions of acceptable statistical significance and power, and the minimum change you’d like to detect. The calculations are complicated; your A/B testing platform should have a tool to do them for you.
You should check in on your test every once in a while to make sure users are seeing the variants and data on the important metrics is flowing.
You should not put any stock in the early results of the test. If you’ve calculated your test needs to run for 18 days, run it for 18 days before drawing any conclusions. A common error is to check in after a few days, marvel at a 5% lift in the metric of choice, end the test early, and roll out the tested variant. This is a mistake.Your A/B test results will often look something like this (below):
The mere fact of changing something can create a temporary lift in your metrics. (This is called the “novelty effect.”) If you stopped the experiment above early and launched “B”, you wouldn’t actually be improving your
product, and you wouldn’t see a permanent lift in your numbers.
Evaluating the results of A/B tests
So you ran your A/B test for the correct length of time and saw that the “B” version of the page did better on your chosen metric(s) (at least as much better as the minimum improvement you specified when determining how long the test needed to run). Good news: you or a statistician, or the platform you’re using did the hard statistics work already. If you did that right, these results are a promising indicator that you should launch the “B” version. But there are some statistics terms that will be thrown around in discussion of these results that you should understand, and there are some caveats on your launch decision that you should be aware of.
What if you didn’t see a lift in your chosen metric? Or you saw a lift but it didn’t reach your statistical significance threshold? Then your test did not give you evidence that the “B” version is better than the “A” version, and data doesn’t support launching the “B” version. Record the test and results, and consider testing other variants. If you have reason to doubt the results, you can also consider re-running the test — what you see are statistics on a random sample, not guaranteed truth.
领英推荐
False positives and false negatives
Experiments can yield the wrong results. Experiments look at a sample from a broader population (your entire user base).
Even if this sample is representative of the population any way you can slice it (demographics, tenure using product, time of day, etc.), every individual is different. Randomness could mean your experiment yields results that don’t reflect the true preferences of the population.
The null hypothesis in an A/B test is that “B” is no better than “A” on your metric(s) of choice. You can come to two conclusions: either you can reject the null hypothesis, or you can fail to reject the null hypothesis. (You never “accept” or “prove” a null hypothesis.) There are two ways your results can be wrong: false positives and false negatives. A false positive means you reject the null hypothesis when you shouldn’t reject it; a false negative means you fail to reject the null hypothesis when you should reject it. Here’s a table of possible outcomes:
There’s no way to avoid errors entirely, but it’s important to understand the likelihood of getting an incorrect result and determine what frequency of errors you’re willing to accept.
Confidence and power
When planning tests, you (or your company as a whole) will define a required level of confidence and power. Confidence
is the inverse of the probability of making a false positive error. Power is the inverse of the probability of making a false negative error. For a given sample size and effect size, there is a trade-off between confidence and power: decreasing the odds of one type of error will increase the odds of the other.
Some terminology: The required confidence level is denoted with 1-α. Its inverse, α, is called the statistical significance level. When people talk about the p-value of a test (the probability of getting similarly extreme results if the null hypothesis is correct), it is usually to compare it to α, the required significance level. The required level of power is denoted with 1-β (its inverse, β, being the probability of making a false negative error). False positive errors are also called Type I errors. False negative errors are also called Type II errors.
It is standard to require a confidence level of 95% and power level of 80% for A/B tests. However, different companies have different requirements — we’ve seen confidence levels as low as 80%. Your A/B testing platform may be set up with defaults. For instance, your dashboard may show you the 95% confidence intervals for the expected value of the metric in the “A” version versus the “B” version. If these intervals do not overlap, that means you met your required confidence level (or, equivalently, your test achieved statistical significance at the α =.05 level).
You may also wish to vary your requirements by test, depending on how costly you’d expect each type of error to
be for the changes in question. However, you should always define confidence and power requirements before beginning your test and stick to them in judging the results.
Guardrail KPIs
The impact of an A/B test on the metric(s) in your hypothesis isn’t the only thing that matters. You should have 2-3 core product KPI that you monitor for all tests. A negative impact on these metrics is a red flag.
Let’s say you changed the text on a button with the goal of getting more people to click on that button, and it worked! But users in the group that saw the new button were less likely to actually complete a purchase and this purchase conversion rate was one of your product KPIs. Although your test was nominally successful, you shouldn’t launch the new text.
Checking representation, sub-populations and test interactions
There are a few other things you should check before declaring a test successful and rolling the change out to your entire user base:
Was your sample truly representative?It’s worth confirming that the traffic directed to the “B” version matched your overall traffic breakdown in terms of gender, country, time of day, new versus existing users, and any other categories you track.
How did specific segments of users respond to the test?
It’s also worth looking at how specific categories of users responded to the test. For instance, if your user base is overwhelmingly male, you might see a strong positive test result for a change that performed well with male users but badly with female users. If one of your strategic goals for the product is appealing more to female users, rolling out that change might be a bad plan even though it would boost your metrics now.
Did the test interact with other concurrent tests?
We’ll discuss this soon and learn the pitfalls, but if you were running multiple tests at once, you should see how users responded to combinations of changes, rather than considering each test in isolation.
Pitfalls of A/B tests
When you first get A/B testing working for your product, it feels like magic. What a great way to make data-driven decisions! But the exhilaration of this new power can lead to some common errors.
Mistake #1: Trying to A/B test too many things
Your product only gets so much traffic. And the activities you care about (e.g., visits to a particular page or clicks on a particular button) only happen for a fraction of that traffic.
This means your sample size for experiments is limited! These limitations show up in three ways:
You can’t afford to test too many variants on one feature. You may be able to try an A/B test of a red versus green button, but not an A/B/C/D/E/F/G test of seven different button colors.
You can’t afford to manipulate too many variables in multivariate testing. Let’s say you want to test out two button colors and two button sizes. That’s four combinations — probably doable. But if you want to test out two colors, two sizes, two text options, two borders, and two placements, that’s thirty-two combinations. Unless your product has millions of daily users, you can’t support that.
You can’t afford to run too many tests at once. Layering tries to avoid each test’s biasing the results of the others, but it’s still important to examine interactions between experiments rather than considering each in isolation. Thus, the same problems as in multivariate testing can emerge if you run too many tests at once, even if they’re manipulating different parts of your product.
Trying to test too many things at once means the sample size for each test is small, and the test must run for a long time to get significant results. If you’re waiting months for the results of your tests, that slows down your launch cadence.
Like all aspects of product management, successful A/B testing is about ruthless prioritization. You may have ideas for hundreds of product variants you could test, but you must use your intuition and customer understanding to narrow that list down. Test the changes you suspect will yield the best results first.
Mistake #2: Ignoring interactions between tests
This is related to (3) above. If you’re running multiple tests at once, layering them correctly is necessary to get accurate results — but it’s not sufficient. You need to also look at which combinations of treatments yield the best results.
Take an example: suppose you’re changing both your page header and the text on a call-to-action (CTA) button, in hopes of increasing clicks on that button. Your page headers are A1 and B1; your CTA button texts are A2 and B2.
You might see the following results (note: we’re assuming 50/50 bucketing for ease of math):
If you consider each test independently, you will pick B1 over A1 (7% CTR > 5% CTR) and B2 over A2 (6.75% CTR
> 5.25% CTR). But the combination (B1, B2) is actually only your third-best option! The (B1,A2) combination is much better. The moral of the story: always look for interactions between the tests. If the data show they’re not entirely independent, you need to treat them like a single multivariate test and pick the best combination.
Mistake #3: Changing too many variables at once
If you want to experiment with multiple variables (e.g., the size and color of a button) or with multiple aspects of your product (e.g., various page headers and various button treatments), then you need to either run a multivariate test (in the first case) or several layered tests (in the second). If you try to do a simple head-to-head comparison of two very different versions of the product, your data will be garbage. You won’t know which changes made the product better, which made it worse, and which had no impact.
Mistake #4: Failing to validate the data before testing
If you want to know what impact your test will have on particular user actions or metrics, you’d better make sure you’ve instrumented that correctly. It’s surprising how often A/B tests get built and launched, only for people to realize
they aren’t gathering the data they need to judge the results. Save yourself some time and test your instrumentation first!
Mistake #5: Ending tests early
We’ve discussed this elsewhere, but it bears repeating because it’s one of the most common errors people running A/B tests make. Do not end a test early simply because the initial results are promising (or unpromising)! To make decisions with good data, you need to stick to your established criteria for statistical significance, which means letting tests run for their full duration.
In addition, the effects you see early in an experiment may be driven by the novelty effect rather than permanent shifts in customer behavior. If you end tests early, you will frequently roll out features that showed promise in tests, only to see no impact on your numbers once the feature is fully released. (You’ll sometimes encounter this frustrating phenomenon even when you let tests run for their planned duration, but it’s a lot more likely to happen if you make this error.)
Mistake #6: Looking at too many numbers
If you have a dashboard of dozens of metrics, your tests will always move some in the direction you want and others in the opposite direction. This creates two big problems:
You never know whether to launch something; the impact is always “mixed”
If you look at enough metrics, every test will have a statistically significant impact on some metric, but that might be purely by chance
Your hypothesis should specify which 1-2 numbers you expect to see improved. You can also have another 2-3 product KPIs to monitor — these are numbers where you don’t necessarily expect to see an impact, but where if you did see a negative effect, that would be a reason not to launch.
Mistake #7: Not retesting
You’ll quickly have a long list of product variations to A/B test, so the concept of retesting something might seem silly — why bother? But there are two reasons this is important:
False positives (and false negatives) are inevitable. Even if you set a fairly stringent p-value standard of p<.05, one in twenty of your tests will yield a false positive result! (People generally require less power than significance, so false negatives will be even more common.)
Your product changes. A year ago, you may have learned that a green button was better than a red or blue button. But in the past year, you may have changed the text and images on the page, as well as the surrounding color scheme. A different color button might work better now.
Of course, you shouldn’t run every test twice — that’s a waste of precious testing space. But if your intuition or lab-based user research says an A/B test result doesn’t seem
quite right, don’t blindly trust the numbers. Remember that you always have the option to run the test again.
Mistake #8: Replacing product management with growth hacking
A/B testing is part of your job as a PM. It’s not all of your job. Because the results are fast and visible, some people get addicted to the thrill of nudging KPIs up by running lots of tests. But your job is to deliver a superb user experience, in a way that helps your company succeed. Your KPIs capture some part of that experience and its impact on your company, but not all of it. Don’t lose sight of the broader mission. Don’t trust numbers to the point of ignoring people.
Conclusion
To win against the competition, businesses of all kinds and across every industry must embrace the digital-first mindset. The COVID-19 pandemic accelerated a transition already underway: customers looking to the digital experience as their primary point of contact with a business. Now, customers have elevated expectations for product experience. They expect companies to deliver high-value features at a rapid pace, and if the digital product or service fails to meet their needs, they will look elsewhere.
Data and Product Analytics is the epicenter of how digital-first companies figure out customer needs and measure the impact of their products. The sooner we take advantage of the ecosystem, the quicker we will be adapting and start to meet our customer needs and achieve mastery in the digital world. At Jio we find it less risky to run a large number of experiments than a small number.
If you are curious to learn more about Experimentation and A/B Testing you may want to try my book Win The Digital Age with Data: How To Use Analytics To Build Products That Customers Love
Technology Leadership | Market Research | Emotion AI | Fintech | AdTech | Accounting
1 年Can't agree more. Which experimentation tool is your favorite? Mine is launchdarkly.
Thanks for sharing the real world examples of major companies like Booking.com, Amazon, and Microsoft to break it down!
Customer Success Manager | AI Implementation Strategist | Transforming businesses through strategic performance optimization.
1 年That's a great point! Experimentation is key to unlocking product success. ??
India Tech Icon 2024 | Award winning CIO Digital Transformation | Technology at Tata CLiQ & Luxury | Serial CTO/CPO, ex-Reliance Retail, Tesco, Red Hat, Cisco | eCommerce, SaaS | AI Fellow
1 年If you find this article interesting then perhaps you will find my book relevant “Win The Digital Age with Data : How To Use Analytics To Build Products That Customers Love” https://amzn.eu/d/hyw7FXu
Product Manager at Tesco || Retail Platform- Price Lifecycle and Optimisation APIs
1 年Very nice article Suman Guha , even we are doing experimentation with price in tesco and getting fantastic results. Ayush Gupta