登录查看更多内容

The vibes about A/B testing are wrong: Why the backlash is a big mistake

Duncan Gilchrist

Co-founder @ Delphina | Ex-Uber (Hiring!)

发布日期: 2025年3月11日

There's a wave of anti-A/B testing sentiment in the air. More leaders are talking about the need for taste-based decision-making; many are actively deriding testing as a crutch that stifles creativity and sidesteps leadership.?

They point to tweetable statements like:

Linear’s Karri Saarinen’s: “We don’t do A/B tests. We validate ideas and assumptions that are driven by taste and opinions, rather than the other way around.”
Intercom’s Des Traynor: “Every product company that focusses on A/B tests … to boost conversion are duping themselves”
Tinder's Brian Norgard: “A/B testing culture guiding product is a clear signal the company is out of new ideas.”
Shopify’s Tobi Lütke: “There needs to be more acceptance in business of unquantifiable things.”?
Airbnb’s Brian Chesky: “A/B testing is abdicating your responsibility to the users.” (This one was taken pretty wildly out of context — but I’ve seen it quoted as such, so needs to be acknowledged.)

As a data leader that’s led large-scale testing initiatives, here’s my take: of course taste is important for decision making; you were kidding yourself if you ever thought otherwise.?

But that doesn’t mean you should throw the baby out with the bath water. A/B testing is an invaluable tool that enables your company to really learn and scale, by systematically separating the wheat from the chaff. You’re also kidding yourself if you think otherwise.

Let’s start by unpacking what’s gone wrong with experimentation.

Where A/B testing goes wrong

Testing gets a bad rap for two reasons:?

tests are being used as a substitute for strategy,?
and data teams aren’t doing testing right.

How testing gets misappropriated for strategy

In the early days at Uber, the company was laser-focused on driving trip volume. This was beautiful in how easy it was to measure and to communicate. But here’s the problem: "more trips" isn't a strategy; it's a metric.?

Focusing on trips led us to double down on creating trips at any cost, for example with short and cheap trips in products like Uber Pool. That’s different from building repeated, high-quality experiences, and led to a tradeoff that nobody intended to make: short-term growth vs. long-term customer value.?

The problem wasn’t our testing — it was the lack of coherent strategy. When this happens in your organization, data leaders need to have a hard conversation. If you’re being asked to ‘test what features users want’, that’s a sign to push back. What kind of experience are teams trying to create? What’s the long-term product vision??

It may feel uncomfortably close to saying, “You need to do your job first, so I can do mine.” But the strategy needs to come first, and then A/B testing can help you test your path toward it.?

There’s a helpful analogy here: A/B testing isn’t going to tell you what hill to climb. Rather, once you pick a hill, A/B testing will help you find your way to the top.?

How teams screw up A/B testing

I’ve also seen the call coming from inside the house — tests are developed, run, or reported on poorly. The problem is that if testing isn’t executed well, then business leaders will justifiably ask, what’s the point?

There are a few common ways teams fall down on testing. These are obvious in hindsight, but easy to trip up on practice:

1/ Not accounting for seasonality

Teams sometimes run tests during atypical time periods and fail to contextualize the results accordingly. For example, a promotion test that happens to fall over a holiday might show spectacular conversion rates that would be impossible to maintain year-round. If you then extrapolate those results as if they're representative of normal conditions, business stakeholders immediately start questioning your judgement.

2/ Missing long-term effects

Short-term metrics frequently hide longer-term consequences that aren't captured in the initial testing window. A feature might drive an immediate conversion uplift of 15% — while simultaneously increasing negative reviews or return rates that only become apparent weeks later. Without implementing proper longer term measurement, these tests can lead to features that optimize immediate results while quietly eroding brand equity and customer lifetime value.

3/ The sum of the parts is just too big

When multiple teams run parallel experiments, each claiming significant improvements, the reported combined impact often exceeds what seems possible. I’ve seen it firsthand: six teams each claim 5% improvements, while the entire business only grew 20% during that period. There are lots of reasons this can happen (we won’t unpack those here) — and it’s critical that data leaders be careful about rolling these effects all together. Because again, claiming nonsensical victories creates serious credibility issues.?

Six common-sense practices for strong A/B testing

After seeing both the successes and failures of testing programs firsthand, I’ve found these six best practices make the difference between testing that drives good decisions and testing that drives skepticism.?

1/ Define strong success metrics

When your experiments don’t have predefined success criteria, you’ll likely end up cherry-picking whatever looks good in the data. Establish 1-2 primary metrics that directly tie to your hypothesis, along with several secondary metrics to catch potential negative impacts in other areas.?

So if your primary metric is conversion rate, track secondary metrics like time-on-page, user satisfaction scores, and 30-day retention to ensure you're not creating downstream problems.?

2/ Write down a clear hypothesis

Vague experimentation without clear direction wastes resources and creates confusion about what insights to extract. Instead of approaching tests with a generic "Let's see what happens if we change X" mindset, frame each experiment with a specific hypothesis: "We believe changing X will improve Y because Z."?

For example, "We believe showing fewer search results per page will increase conversion because it reduces cognitive load for customers." This structure forces teams to articulate their reasoning and creates natural guardrails for interpretation when results come in.

3/ Run tests long enough

The pressure to move quickly often leads teams to cut testing windows short, missing critical medium and long-term effects. Make sure your tests have sufficient time to capture the full impact of your changes, particularly for features that might influence customer behavior patterns over time.?

Consider that a pricing change might show an immediate uptick in conversions — but lead to decreased customer lifetime value that only becomes apparent after a few months.?

4/ Share both positive and negative results

The nature of experimentation means some hypotheses won’t pan out. But don’t sweep failed tests under the rug — create a culture where every test is valued as a learning opportunity.?

Such as: "Our test to simplify the checkout flow actually decreased conversion by 2%, teaching us that users value security indicators more than we expected." This transparency builds credibility with stakeholders and creates institutional knowledge that prevents teams from repeatedly testing bad ideas.

As Ramesh Johari explained on our podcast, High Signal, this is critical to becoming what he calls “a self learning organization”.

5/ Haircut appropriately

No test exists in a perfect vacuum, and pretending otherwise undermines trust. Acknowledge when a test might be impacted by external factors, and apply appropriate "haircuts" to results when reporting up the chain.?

For instance, "We saw a 10% improvement, but since it was during our peak season, we're conservatively estimating a 5% annual impact." This honest approach builds confidence in your reporting and establishes your team as trustworthy partners rather than metric chasers who don’t understand business context.

6/ Lastly: Run your program tightly

When different teams use different methodologies and reporting approaches, it’s impossible to compare results across experiments — and easy for the entire testing program to lose credibility. Instead, ensure all teams use the same standardized approach to measuring and reporting impact.?

For example, require every test report to include both the relative improvement ("conversion improved by 12%") and the absolute change ("from 5.2% to 5.8%") to prevent discrepancies between how results are interpreted by business stakeholders.

Running a tight, high-quality testing program is such an easy way to drive significant impact and credibility in your organization.

Striking the balance: taste for strategy, testing for tactics

Back at Uber, when the company finally figured out that we needed to be more deliberate about our strategy, we had a realization: a ‘trip isn’t just a trip’. Not all trips are created equal, and we shouldn’t indiscriminately optimize for trips alone. We needed to define a comprehensive strategy, and then a basket of metrics that reflected those goals.?

Despite leaders sometimes complaining that A/B testing encourages focusing on the numbers rather than the big picture, that’s actually the whole point. It just needs to happen within the right context:

Strategy and big directional decisions based on taste, vision, and leadership judgment
Execution and optimization based on rigorous testing and data

Once you’re using the right metrics, A/B testing can do what it’s supposed to: enable more reliable (and much easier) tactical decision-making. It allows teams to quickly, independently, and consistently learn what works and make decisions accordingly — allowing a large organization to get far more done.

So the next time you hear leaders criticizing A/B testing, listen carefully to what they’re actually saying. It doesn’t need to be an OR here, it should be an AND. Testing is a compass, not a roadmap. Pick the right hill, and use testing to chart the fastest path.

Delphina

1,074 位关注者

JP Bida

righteous awe for meek perfection

6 天前

The amount of bot action out there is Blade Runner level and any DS these days needs to be seeing eyeballs before making decisions. Got to find that guy who pitched the jump to conclusions mat in office space and give him a call.

Arturo Campos

Analytics, Cloud & AI en Attach

6 天前

Flavio Flores

Craig Sullivan

Optimising Experimentation: Industry leading Expertise, Coaching and Mentorship

1 周

I'd also add that there's a trend here - with experimentation moving more into product teams, this means there *is* more focus on strategic testing, business future exploration, service models, packages, really deep business questions - rather than tactical, marketing, short term, easy win low risks tests. I am hopeful about the outcome but its clear that all these areas (see pic) need to be looked after, to ensure confident and reliable decisions can be made (the whole point of doing this). What I'm worried about is how we can educate and support teams who are less experienced with testing, so they don't make the same mistakes as many of us did. Since companies often don't share experiments or their process, there's nobody there to provide independent confirmation that you are 'doing it right'.

1 次回应

Craig Sullivan

Optimising Experimentation: Industry leading Expertise, Coaching and Mentorship

1 周

Great article and covers many of the problems I see working with product teams. Even when we get past the myths and objections like "It slows us down" (speed and velocity are not the same) - we still find lots of problems. If you work back from lousy experiment output, you see that the idea or experiment design was flawed. So you go back another step and someone says "X asked us to build it this way". Back a further step and you find the idea was just 'solution bias' or 'prediction' masquerading as a test. Then back another step to find discovery & research weren't adequate or flawed. Often we find several points where there is 'relevance bleed' or the original good idea is compromised along the way. Researchm Problem exploration & statements, user stories, hypothesis, experiment design & build, analysis - things go wrong at all steps and they add up. I agree with you on governance (making sure the program is accountable, measurable and working). The biggest problem for last - statistical knowledge. Many companies and agencies don't do pre-test analysis, or figure out what sensitivity a given time period would allow. The biggest crime of all, people still stop tests when they hit significance, which destroys trust.

1 次回应

Caio (Kyle) Gomes

Chief AI Officer & Chief Data Officer @ Magalu / Luizalabs

1 周

There is a difference between marketing and reality. In the real world, airbnb does a lot of a/b testing.

查看更多评论

要查看或添加评论，请登录

Duncan Gilchrist的更多文章

The Paradox of Optimism in Data Science

2025年2月11日

The Paradox of Optimism in Data Science

There's a paradox inherent to being a data leader in 2025: you have to simultaneously be an optimist who believes in…

13 条评论
The Greatest Minds in Data Science: Insights from the First Seven Episodes of High Signal

2024年12月24日

The Greatest Minds in Data Science: Insights from the First Seven Episodes of High Signal

I may be biased, but High Signal hits differently. Since we launched in October 2024, we’ve hosted conversations with…

3 条评论
5 ways stakeholders stall out critical ML initiatives — and what to do about it

2024年12月17日

5 ways stakeholders stall out critical ML initiatives — and what to do about it

ML is uniquely powerful, complex, and opaque (as we’ve written about before!). It’s also relatively new.

7 条评论
Truth, Lies, and ROI: Our Testing Philosophy at Delphina

2024年10月10日

Truth, Lies, and ROI: Our Testing Philosophy at Delphina

This post is authored by Delphina engineer Thomas Barthelemy. After completing his MS in CS at Stanford, Thomas joined…
Why AutoML failed to live up to the hype

2024年9月11日

Why AutoML failed to live up to the hype

The mid-2010s were an exciting time for data science. Big data technologies like Hadoop and Spark were being adopted at…

5 条评论
What advanced analytics teams are doing that you aren’t

2024年8月1日

What advanced analytics teams are doing that you aren’t

At every company I've have worked at, the data science team faced a burning — yet often unspoken — question: what…

6 条评论
Why PhDs whiff the onsite, and how to find a diamond in the rough

2024年6月20日

Why PhDs whiff the onsite, and how to find a diamond in the rough

Lessons from hiring 100+ data scientists Throughout my career, I’ve interviewed or advised hundreds of new PhDs trying…

4 条评论
The Danger Zone in Data Science

2024年5月29日

The Danger Zone in Data Science

When I was on Uber’s Marketplace team, we would (semi) joke that we were lurching from crisis to crisis. Our dozens of…

2 条评论
The 7 personas of Machine Learning – and what they need from you as a leader

2024年4月16日

The 7 personas of Machine Learning – and what they need from you as a leader

In 2020, Barr Moses and the Monte Carlo team wrote a powerful post on the data team personas involved in what they…

6 条评论
The six most painstaking steps in machine learning – what your team isn’t telling you

2024年3月13日

The six most painstaking steps in machine learning – what your team isn’t telling you

(Sequel to The paradox of machine learning – what leaders need to know, which dives into a specific story of machine…

1 条评论

See all articles

Where A/B testing goes wrong

How testing gets misappropriated for strategy

How teams screw up A/B testing

Six common-sense practices for strong A/B testing

Striking the balance: taste for strategy, testing for tactics

Delphina

1,074 位关注者

Duncan Gilchrist的更多文章

The Paradox of Optimism in Data Science

The Greatest Minds in Data Science: Insights from the First Seven Episodes of High Signal

5 ways stakeholders stall out critical ML initiatives — and what to do about it

Truth, Lies, and ROI: Our Testing Philosophy at Delphina

Why AutoML failed to live up to the hype

What advanced analytics teams are doing that you aren’t

Why PhDs whiff the onsite, and how to find a diamond in the rough

The Danger Zone in Data Science

The 7 personas of Machine Learning – and what they need from you as a leader

The six most painstaking steps in machine learning – what your team isn’t telling you

社区洞察