How Experimentation Can Lead to Product Success

How Experimentation Can Lead to Product Success

The most reliable way to gather data on a potential change to your product is to run a controlled experiment. With software products, as well as ads, websites, and marketing emails, A/B testing gives you this power. If you’re the first PM for your product or even at your company to A/B test, getting set up and up to speed can be arduous, but it’s well worth the effort. In this article, I will discuss how to use the power of A/B testing and how to get started. But wait why should you learn it!

Let me help you build reasoning! Do you know ?

Over 148,113 companies are using A/B Testing tools
A/B Testing Customers by Industry

Source of Marketshare figure 1.0 url

Major corporations such as Amazon, Airbnb, Booking, Jio, Google, Meta, Intuit and many others use Experimentation and A/B Testing. For instance, Booking.com runs more than 1,000 rigorous tests simultaneously and, by my estimates, more than 25,000 tests a year. At any given time, quadrillions (millions of billions) of landing-page permutations are live, meaning two customers in the same location are unlikely to see the same version. All this experimentation has helped transform the company from a small Dutch start-up to the world’s largest online accommodation platform in less than two decades.

Booking.com isn’t the only firm to discover the power of online experiments. Digital giants such as Amazon, Facebook, Google, and Microsoft have found them to be a game changer when it comes to marketing and innovation. They’ve helped Microsoft’s Bing unit, for instance, make dozens of monthly improvements, which collectively have boosted revenue per search by 10% to 25% a year. (See “The Surprising Power of Online Experiments,” HBR, September–October 2017.) Firms without digital roots—including FedEx, State Farm, and H&M—have also embraced online testing, using it to identify the best digital touchpoints, design choices, discounts, and product recommendations.

Indian market leaders are equally using experimentation in their product development culture, if you study companies like Jio, Cred, Flipkart, Swiggy and several others use experimentation.

There is a brilliant post on Experimentation at Airbnb by Jan Overgoor.

Anyway now that you know the reasoning, let’s dig in to discuss key topics that will allow you to build a mental model to create a roadmap for planning, running and understand your first experiment:

  • When should you run A/B tests? A/B tests are good for answering some kinds of questions, and not so good for answering others.
  • What equipping your product for A/B testing? Tests don’t yield good data unless done right. With your company’s in-house platform or an

off-the-shelf solution, you can segment users, test out different versions of the product, and view the results

  • How to plan and run A/B tests? Once you’re set up on an A/B testing platform, what steps should you follow for each test you run?
  • How to evaluate the results of A/B tests. To help you figure out what your A/B test results mean. Should you be aware of Pitfalls of A/B testing. A/B testing isn’t a panacea, and it’s easy to mess up the design, execution, or interpretation of your tests. What to expect when running A/B tests

Running A/B tests is an exercise in patience and prioritization. You’ll want to test many things; you can test only a few at a time. You’ll want answers quickly; most tests will take weeks to generate significant results. You should also be prepared for ambiguous results. Some changes will boost the target metric but tank a core product KPI. Many tests will have a positive but statistically insignificant impact that likely just reflects a novelty effect.

It’s not all frustration, though: the excitement of finding a change that unambiguously knocks it out of the park is tremendous. And with A/B tested product improvements, it’s easy to pinpoint the impact of your choice on product outcomes

Why A/B testing matters?

Despite their limitations in the types of changes they work well for, A/B tests are unparalleled in their ability to provide clean data. They allow you to try out and launch small but real product improvements. Optimizing many parts of your product in small ways can lead to a large aggregate impact on key metrics like DAUs, conversion rates, and revenues. The sorts of questions you can answer with A/B tests include:

  • If I change the placement of this button, will more people make a purchase?
  • Which marketing email headline gets the highest open rate?
  • Would changing the page image cause people to stay on the page longer?
  • Which button text and color combination garners the most clicks?
  • Which feature callout boosts feature discovery most?

What are A/B tests?

A/B tests are one of many user research tools you can use to evaluate potential changes to your product. Companies like them because, unlike in-lab studies, they produce statistically significant results with large samples. Most large apps and websites are constantly running dozens of A/B tests to help them optimize the details of their product.

An A/B test is a controlled experiment. You randomly sort users into buckets and give different buckets different versions of the product. (The “A” is generally the current version or control; the “B” incorporates a proposed change.) This allows you to compare how users respond to the variants and decide which ones you should incorporate into the product going forward. Despite the name, A/B tests are not limited to two product variants — it’s possible to try several options at once. The number of variants, split of traffic, and size of your user base determine how long you’ll need to run a test to get statistically significant results.

Although the graphic above shows split of traffic between “A” and “B” for illustrative purposes, you will usually send the majority of traffic (95%+) to the control product (let’s consider “A” is your current offering). You want to feel free to test variants you aren’t sure will succeed, but sending 50% of your traffic to a version with a risky change could drastically hurt overall product performance.

One variable at a time

The number one rule of A/B tests (and pretty much any experiment) is to only manipulate one variable at a time.

Say you’re changing attributes of a green call-to-action button that you want more users to click. If your “B” version of the page changes the button so that it’s red and larger, your results will be inconclusive. If the “A” version does better than the “B” version, you still don’t know whether a larger green button (or a same-sized red button) would have outperformed your current design. If the “B” version does better than the “A” version, you don’t know which change contributed to the improvement: the color change, the size change, or both.

In this case, it would be appropriate to run a multivariate test with four buckets — a small, green button; a large, green button; a small, red button; and a large, red button. You can also use multivariate tests ****to try things like different combinations of page headers and images. The key is to have a test bucket for every possible combination, not just two buckets with multiple variables changed between then.

What questions can’t A/B tests answer?

Conversely, an A/B test won’t answer questions like:

  • What product features should I launch next?
  • How should I re-design the app?

These are badly suited to A/B tests for a few reasons:

  1. The variants are expensive to produce. You shouldn’t be building new features on a hunch and waiting for A/B tests to tell you whether they’re good idea; you should instead be using qualitative research tactics along the way to validate your feature ideas before fully investing.
  2. They involve too many changes as once, so the A/B test slows you down or doesn’t tell you what’s working. Sometimes you have to do a complete overhaul of your product’s design. You can’t A/B test each incremental change to every design element — it would take you years to launch the new version! And you won’t learn much from A/B testing the entirely re-designed version against the old one; too much has changed.

Some companies have a policy of testing every change before launch. For instance, an advertising platform might insist that any changes run on 1% of traffic for a week before rolling out to the other 99% of traffic, to confirm that they don’t hurt revenue. Thus, they would “test” changes we don’t recommend using A/B testing for. These aren’t A/B tests in the sense that we mean: they’re safeguards rather than tests of specific hypotheses.

The amount of traffic your product gets also determines whether you can use A/B tests to answer questions. With a small user base, it will take a very long time to get statistically

significant results. A/B testing is best suited for products that get thousands to millions of daily users.

How to plan and run A/B tests?

  1. Generate a hypothesis

What product change do you want to try, and what metric(s) do you expect to change as a result?

Don’t just randomly try making changes and see how your KPIs are affected; this will randomly get you false positive results. Instead, be specific about both sides of this question.

Examples of good hypotheses:

Changing our call-to-action button from blue to red will increase the percentage of site visitors who click on it

A 10% discount promotional offer will yield more purchases and (proportionally) higher revenues than our standard pricing with no promotion

Placing the “compose” button at the bottom of the screen instead of the top (its current position) will increase the percentage of users who start drafting a message

Your hypotheses should call for one-sided statistical tests. That is, you should hypothesize not just that a product change with affect a metric of interest, but that it will increase or decrease that metric.

Don’t forget to make sure your product is instrumented to gather the data you need! If you’re not tracking what percentage of users click a particular button, you need to fix that before you can experiment with whether variants will increase the click rate.

  1. Build the variants

The “A” group will be directed to your current product, but you need to build the “B” (and maybe “C” and beyond) version(s) of the product before you can send users to them. If you’re simply changing a color or a string of text, this is easy. However, if you’re changing pricing, that needs to be reflected on your payment screen and in the transaction back-end. If you’re trying out different layout or icons, you’ll need to work with designers to generate those. Scope out this work and plan for it in your product roadmap.

  1. Determine the traffic split and length of test

If you have a statistics team to help you, call them in for this part!

You have three important questions to answer here:

What’s your audience for this test? If your product has millions of users all around the world, do you hypothesize that this change will improve the product for all of them? Maybe it’s geared towards European users only, or maybe it’s a text change that you only want to test on English-language users. Perhaps you only want to try a

pricing promotion on users who already have accounts and are logged in. Regardless, you need to get clear on your test audience (and its size) before you can answer questions (2) and (3).

What percentage of traffic should you send to the “B” (and “C”, “D”...) versions? The more traffic you send to the variants (up to an even split), the faster you can achieve statistically significant results. But there’s some chance that your variants are considerably worse than your current product. Are you willing to risk sending a large portion of your traffic to an experimental version of the product? Users also tend to be averse to change — if you use large “B” buckets in A/B tests, your users may be annoyed when they perceive constant product changes. Some companies institute standard policies requiring that, say, 90% of traffic be in the control group. Thus, you could send up to 10% of users to the “B” version in an A/B test, or up to 3.33% of users to each of the “B”, “C”, and “D” versions in a test with four buckets.

How long do you need to run your test (in order to achieve statistically significant results)? This will depend on how much traffic the relevant part of your product gets (your sample size), your answer to question (1), your definitions of acceptable statistical significance and power, and the minimum change you’d like to detect. The calculations are complicated; your A/B testing platform should have a tool to do them for you.

  1. Set up the test in your A/B testing platform Enough said.
  2. Monitor your test along the way

You should check in on your test every once in a while to make sure users are seeing the variants and data on the important metrics is flowing.

You should not put any stock in the early results of the test. If you’ve calculated your test needs to run for 18 days, run it for 18 days before drawing any conclusions. A common error is to check in after a few days, marvel at a 5% lift in the metric of choice, end the test early, and roll out the tested variant. This is a mistake.Your A/B test results will often look something like this (below):

The mere fact of changing something can create a temporary lift in your metrics. (This is called the “novelty effect.”) If you stopped the experiment above early and launched “B”, you wouldn’t actually be improving your

product, and you wouldn’t see a permanent lift in your numbers.

Evaluating the results of A/B tests

So you ran your A/B test for the correct length of time and saw that the “B” version of the page did better on your chosen metric(s) (at least as much better as the minimum improvement you specified when determining how long the test needed to run). Good news: you or a statistician, or the platform you’re using did the hard statistics work already. If you did that right, these results are a promising indicator that you should launch the “B” version. But there are some statistics terms that will be thrown around in discussion of these results that you should understand, and there are some caveats on your launch decision that you should be aware of.

What if you didn’t see a lift in your chosen metric? Or you saw a lift but it didn’t reach your statistical significance threshold? Then your test did not give you evidence that the “B” version is better than the “A” version, and data doesn’t support launching the “B” version. Record the test and results, and consider testing other variants. If you have reason to doubt the results, you can also consider re-running the test — what you see are statistics on a random sample, not guaranteed truth.

False positives and false negatives

Experiments can yield the wrong results. Experiments look at a sample from a broader population (your entire user base).

Even if this sample is representative of the population any way you can slice it (demographics, tenure using product, time of day, etc.), every individual is different. Randomness could mean your experiment yields results that don’t reflect the true preferences of the population.

The null hypothesis in an A/B test is that “B” is no better than “A” on your metric(s) of choice. You can come to two conclusions: either you can reject the null hypothesis, or you can fail to reject the null hypothesis. (You never “accept” or “prove” a null hypothesis.) There are two ways your results can be wrong: false positives and false negatives. A false positive means you reject the null hypothesis when you shouldn’t reject it; a false negative means you fail to reject the null hypothesis when you should reject it. Here’s a table of possible outcomes:

There’s no way to avoid errors entirely, but it’s important to understand the likelihood of getting an incorrect result and determine what frequency of errors you’re willing to accept.

Confidence and power

When planning tests, you (or your company as a whole) will define a required level of confidence and power. Confidence

is the inverse of the probability of making a false positive error. Power is the inverse of the probability of making a false negative error. For a given sample size and effect size, there is a trade-off between confidence and power: decreasing the odds of one type of error will increase the odds of the other.

Some terminology: The required confidence level is denoted with 1-α. Its inverse, α, is called the statistical significance level. When people talk about the p-value of a test (the probability of getting similarly extreme results if the null hypothesis is correct), it is usually to compare it to α, the required significance level. The required level of power is denoted with 1-β (its inverse, β, being the probability of making a false negative error). False positive errors are also called Type I errors. False negative errors are also called Type II errors.

It is standard to require a confidence level of 95% and power level of 80% for A/B tests. However, different companies have different requirements — we’ve seen confidence levels as low as 80%. Your A/B testing platform may be set up with defaults. For instance, your dashboard may show you the 95% confidence intervals for the expected value of the metric in the “A” version versus the “B” version. If these intervals do not overlap, that means you met your required confidence level (or, equivalently, your test achieved statistical significance at the α =.05 level).

You may also wish to vary your requirements by test, depending on how costly you’d expect each type of error to

be for the changes in question. However, you should always define confidence and power requirements before beginning your test and stick to them in judging the results.

Guardrail KPIs

The impact of an A/B test on the metric(s) in your hypothesis isn’t the only thing that matters. You should have 2-3 core product KPI that you monitor for all tests. A negative impact on these metrics is a red flag.

Let’s say you changed the text on a button with the goal of getting more people to click on that button, and it worked! But users in the group that saw the new button were less likely to actually complete a purchase and this purchase conversion rate was one of your product KPIs. Although your test was nominally successful, you shouldn’t launch the new text.

Checking representation, sub-populations and test interactions

There are a few other things you should check before declaring a test successful and rolling the change out to your entire user base:

Was your sample truly representative?It’s worth confirming that the traffic directed to the “B” version matched your overall traffic breakdown in terms of gender, country, time of day, new versus existing users, and any other categories you track.

How did specific segments of users respond to the test?

It’s also worth looking at how specific categories of users responded to the test. For instance, if your user base is overwhelmingly male, you might see a strong positive test result for a change that performed well with male users but badly with female users. If one of your strategic goals for the product is appealing more to female users, rolling out that change might be a bad plan even though it would boost your metrics now.

Did the test interact with other concurrent tests?

We’ll discuss this soon and learn the pitfalls, but if you were running multiple tests at once, you should see how users responded to combinations of changes, rather than considering each test in isolation.

Pitfalls of A/B tests

When you first get A/B testing working for your product, it feels like magic. What a great way to make data-driven decisions! But the exhilaration of this new power can lead to some common errors.

Mistake #1: Trying to A/B test too many things

Your product only gets so much traffic. And the activities you care about (e.g., visits to a particular page or clicks on a particular button) only happen for a fraction of that traffic.

This means your sample size for experiments is limited! These limitations show up in three ways:

You can’t afford to test too many variants on one feature. You may be able to try an A/B test of a red versus green button, but not an A/B/C/D/E/F/G test of seven different button colors.

You can’t afford to manipulate too many variables in multivariate testing. Let’s say you want to test out two button colors and two button sizes. That’s four combinations — probably doable. But if you want to test out two colors, two sizes, two text options, two borders, and two placements, that’s thirty-two combinations. Unless your product has millions of daily users, you can’t support that.

You can’t afford to run too many tests at once. Layering tries to avoid each test’s biasing the results of the others, but it’s still important to examine interactions between experiments rather than considering each in isolation. Thus, the same problems as in multivariate testing can emerge if you run too many tests at once, even if they’re manipulating different parts of your product.

Trying to test too many things at once means the sample size for each test is small, and the test must run for a long time to get significant results. If you’re waiting months for the results of your tests, that slows down your launch cadence.

Like all aspects of product management, successful A/B testing is about ruthless prioritization. You may have ideas for hundreds of product variants you could test, but you must use your intuition and customer understanding to narrow that list down. Test the changes you suspect will yield the best results first.

Mistake #2: Ignoring interactions between tests

This is related to (3) above. If you’re running multiple tests at once, layering them correctly is necessary to get accurate results — but it’s not sufficient. You need to also look at which combinations of treatments yield the best results.

Take an example: suppose you’re changing both your page header and the text on a call-to-action (CTA) button, in hopes of increasing clicks on that button. Your page headers are A1 and B1; your CTA button texts are A2 and B2.

You might see the following results (note: we’re assuming 50/50 bucketing for ease of math):

If you consider each test independently, you will pick B1 over A1 (7% CTR > 5% CTR) and B2 over A2 (6.75% CTR

> 5.25% CTR). But the combination (B1, B2) is actually only your third-best option! The (B1,A2) combination is much better. The moral of the story: always look for interactions between the tests. If the data show they’re not entirely independent, you need to treat them like a single multivariate test and pick the best combination.

Mistake #3: Changing too many variables at once

If you want to experiment with multiple variables (e.g., the size and color of a button) or with multiple aspects of your product (e.g., various page headers and various button treatments), then you need to either run a multivariate test (in the first case) or several layered tests (in the second). If you try to do a simple head-to-head comparison of two very different versions of the product, your data will be garbage. You won’t know which changes made the product better, which made it worse, and which had no impact.

Mistake #4: Failing to validate the data before testing

If you want to know what impact your test will have on particular user actions or metrics, you’d better make sure you’ve instrumented that correctly. It’s surprising how often A/B tests get built and launched, only for people to realize

they aren’t gathering the data they need to judge the results. Save yourself some time and test your instrumentation first!

Mistake #5: Ending tests early

We’ve discussed this elsewhere, but it bears repeating because it’s one of the most common errors people running A/B tests make. Do not end a test early simply because the initial results are promising (or unpromising)! To make decisions with good data, you need to stick to your established criteria for statistical significance, which means letting tests run for their full duration.

In addition, the effects you see early in an experiment may be driven by the novelty effect rather than permanent shifts in customer behavior. If you end tests early, you will frequently roll out features that showed promise in tests, only to see no impact on your numbers once the feature is fully released. (You’ll sometimes encounter this frustrating phenomenon even when you let tests run for their planned duration, but it’s a lot more likely to happen if you make this error.)

Mistake #6: Looking at too many numbers

If you have a dashboard of dozens of metrics, your tests will always move some in the direction you want and others in the opposite direction. This creates two big problems:

You never know whether to launch something; the impact is always “mixed”

If you look at enough metrics, every test will have a statistically significant impact on some metric, but that might be purely by chance

Your hypothesis should specify which 1-2 numbers you expect to see improved. You can also have another 2-3 product KPIs to monitor — these are numbers where you don’t necessarily expect to see an impact, but where if you did see a negative effect, that would be a reason not to launch.

Mistake #7: Not retesting

You’ll quickly have a long list of product variations to A/B test, so the concept of retesting something might seem silly — why bother? But there are two reasons this is important:

False positives (and false negatives) are inevitable. Even if you set a fairly stringent p-value standard of p<.05, one in twenty of your tests will yield a false positive result! (People generally require less power than significance, so false negatives will be even more common.)

Your product changes. A year ago, you may have learned that a green button was better than a red or blue button. But in the past year, you may have changed the text and images on the page, as well as the surrounding color scheme. A different color button might work better now.

Of course, you shouldn’t run every test twice — that’s a waste of precious testing space. But if your intuition or lab-based user research says an A/B test result doesn’t seem

quite right, don’t blindly trust the numbers. Remember that you always have the option to run the test again.

Mistake #8: Replacing product management with growth hacking

A/B testing is part of your job as a PM. It’s not all of your job. Because the results are fast and visible, some people get addicted to the thrill of nudging KPIs up by running lots of tests. But your job is to deliver a superb user experience, in a way that helps your company succeed. Your KPIs capture some part of that experience and its impact on your company, but not all of it. Don’t lose sight of the broader mission. Don’t trust numbers to the point of ignoring people.

Conclusion

To win against the competition, businesses of all kinds and across every industry must embrace the digital-first mindset. The COVID-19 pandemic accelerated a transition already underway: customers looking to the digital experience as their primary point of contact with a business. Now, customers have elevated expectations for product experience. They expect companies to deliver high-value features at a rapid pace, and if the digital product or service fails to meet their needs, they will look elsewhere.

Data and Product Analytics is the epicenter of how digital-first companies figure out customer needs and measure the impact of their products. The sooner we take advantage of the ecosystem, the quicker we will be adapting and start to meet our customer needs and achieve mastery in the digital world. At Jio we find it less risky to run a large number of experiments than a small number.

If you are curious to learn more about Experimentation and A/B Testing you may want to try my book Win The Digital Age with Data: How To Use Analytics To Build Products That Customers Love

Soumitra Ghosh

Technology Leadership | Market Research | Emotion AI | Fintech | AdTech | Accounting

1 年

Can't agree more. Which experimentation tool is your favorite? Mine is launchdarkly.

Thanks for sharing the real world examples of major companies like Booking.com, Amazon, and Microsoft to break it down!

Jared Clemons

Customer Success Manager | AI Implementation Strategist | Transforming businesses through strategic performance optimization.

1 年

That's a great point! Experimentation is key to unlocking product success. ??

Suman Guha

India Tech Icon 2024 | Award winning CIO Digital Transformation | Technology at Tata CLiQ & Luxury | Serial CTO/CPO, ex-Reliance Retail, Tesco, Red Hat, Cisco | eCommerce, SaaS | AI Fellow

1 年

If you find this article interesting then perhaps you will find my book relevant “Win The Digital Age with Data : How To Use Analytics To Build Products That Customers Love” https://amzn.eu/d/hyw7FXu

Chandan Raj

Product Manager at Tesco || Retail Platform- Price Lifecycle and Optimisation APIs

1 年

Very nice article Suman Guha , even we are doing experimentation with price in tesco and getting fantastic results. Ayush Gupta

要查看或添加评论,请登录

Suman Guha的更多文章

社区洞察

其他会员也浏览了