登录查看更多内容

A/B testing at Netflix, Uber, Pinterest, Google, LinkedIn, and Spotify- in easy words

Shaurya Uppal

Data Scientist | MS CS, Georgia Tech | AI, Python, SQL, GenAI | Inventor of Ads Personalization RecSys Patent | Makro | InMobi (Glance) | 1mg | Fi

发布日期: 2022年4月7日

What is A/B Testing?

A/B testing is a data-driven approach for making business decisions instead of guesswork. It is also known as split?testing, where users on the platform are split by a randomized experimentation process into two or more variants; the control group (original or older version) and the treatment group (new version).

The crux of A/B Testing

A/B-Testing is simple but not easy. It is one of the toughest data skills to acquire.

1. Many think it is a simple split of users randomly on a 50-50 basis, keeping one set in the control variant and another set in the treatment variant.

2. Run the experiment for a business cycle or till you get the reach the required sample size from power analysis.

3. Conduct hypothesis testing, and find the p-value; if the p-value is less than the significance level reject Ho (null hypothesis).

Looks simple in theory but there are a lot of things to check if control and treatment got contaminated, if the random split was right or not, check for novelty and primacy effect, verify variants internal change with A/A testing, etc. In this Newsletter, we would be exploring the pitfalls of A/B Testing.

I’ll hold your hands through common mistakes that happen at the pre-, at-, and post-experiment stages.

1. A/A Testing Ignored

A/A test pits two exactly identical pages against each other. The goal of an A/A test is to check that there is no difference between your control and variation versions. A/A testing is a good way to run a sanity check before you run an A/B test. This should be done whenever you start using a new tool or go for a new implementation. A/A testing in these cases will help check if there is any discrepancy in data.

Pitfall 0: A/A testing gives us a sanity check that there are no internal variant breaks and no issue with the engineering team's instrumentation.

2. Power Analysis and Data Peeking

Power Analysis is a method that helps us calculate the required sample size before concluding the results of A/B Testing.

Let us check an online sample size calculator -

The sample size increases if the minimum detectable effect decreases. MDE is a number set in discussion with business and product managers; it explains what is the smallest acceptable difference between the treatment and control groups? It helps understand what minimum effect is worth it, considering the business costs to update (e.g., engineering time, business cycle, bad UX, etc.)

Pitfall 1: Not understanding power analysis and end the experiment prematurely is a very common mistake.

Data peeking is checking experiments results prematurely before meeting sample size requirements. In classical A/B Testing -> You cannot peek and rely on results early, one has to wait for a required sample size collection.

Don’t end the experiment prematurely because you have seen a positive uptick in the treatment group in the first few days (False Positive Case).

I understand data peeking has a business advantage: stopping the experiment early before sample size collection can help not lose the opportunity cost of earning dollars (in case of a positive outcome) or lose dollars (in case of a negative outcome).

Solution: For early stopping and continuous testing - Sequential A/B Testing needs to be Adopted instead of a Classical A/B Testing setup.

3. Randomization

Often this 50-50 split is done on the entire user base level which may potentially be the cause for failure of the A/B-Testing experiment. (This problem is very common and the majority of data scientists go wrong with this; added a screenshot of a fellow data scientist pointed out the issue aptly).

?I will share my experience, I have seen folks doing A/B testing at random where user distribution is not stratified. Only 5% of the user base were heavy users and a random 50-50 split of all the users does not guarantee an equal split of heavy users in the control and treatment groups.

Pitfall 2: Heavy users or power users are the "major drivers" of the platform. If this split goes wrong A/B testing experimentation key principle fails. A random assignment fails to distribute heavy users equally.

People assume an equal split on a bigger set randomly is equal to a random equal split on each group level which is not true.

Sorin Anagnoste 9 个月前

Unleashing the Power of Data: A Deep Dive into…

Duke MEM Product Management Club 10 个月前

Your users are not reading your push notifications…

Dr. Kofi Okyere-Dede 1 个月前

For the pitfall shared above, I would solve this by doing a stratified sampling in each user activity group. First sample; heavy user group then medium user then the less active users with 50-50 split.

4. Contamination Issues

Ceteris Paribus
Spillover effect

Pitfall 3: Not taking ceteris paribus into account.

Scenario: Let us say, we are running an A/B test on a webpage. In the control variant, the 5. web page loading time is 5seconds and in the treatment variant, the web page load time is 8seconds.

This is not an equal comparison of the variants. If you want to measure the impact of control vs treatment variant web-page UI, you need to make all other parameters equal i.e. adding a delay factor on control of 3seconds on load time in the control variant as well (in the above case).

Pitfall 4: Cross-contamination between the treatment and control groups.

Cross-contamination happens when a visitor clicks on some different content, has an unplanned experience, or makes some other impressions in between arriving at your site and landing on the A or B test page.

Example: You are running A/B testing on the app based on the city level split of users. If I travel from New Delhi to (Gurgaon) Haryana. I get exposed to two different variants.

5. Post experiment Analysis

Pitfall 5: Wrong significance level set in Multiple variant testing. Getting fouled by False Positive in Multiple testing problem.

Multiple Testing Problem

- Set α level to 5%

-  P(False Positive) = 5%: no effect but falsely concludes there is an effect
- P(No False Positive) = 1 — 5% = 95%
- For 3 comparisons, P(at least one FP) = 1 — (95%)^3 = 14.3% ~ 14%

For 3 groups, we see the Probability of a false positive becomes 14%, this is a multiple testing problem. To solve this problem; we can apply Bonferroni correction i.e. new significance level = α/n i.e. 0.05/3 = 0.016

Simpson's Paradox

Pitfall 6: Individual and aggregated patterns look different for ramp-up experiments.

Novelty and Primacy Effect

Novelty Effect: Existing users want to try out all-new functions. Due to a new launch feature users try it for the initial few days and then don't check it.

Primacy Effect: The primacy effect is the tendency to remember the first piece of information we encounter better than information presented later on.

Solution: The best way of teasing out both effects is to resort to new users who have not been exposed to the old version.

Pitfall 7: Fail to check the primacy and novelty effects, biasing the treatment effect.

Conclusion

Learning from mistakes and understanding where things can go wrong is the best way of mastering a skill. I would again like to repeat and conclude that A/B testing is simple but not easy, it has a lot of caveats/checks; there are many ways of doing A/B testing wrong but only one right path. Ideal A/B Testing is hard to achieve.

Recommended Posts: Learning about p-hacking

I hope you learned something new from this post. If you liked it, hit ?? and share this with others. Stay tuned for the next one!

Connect, Follow or Endorse me on?LinkedIn?if you found this read useful. To learn more about me visit:?Here

Reference:

Disclaimer: I don’t endorse any brand. Netflix, Uber, Pinterest, Google, LinkedIn, and Spotify names are used to make the Newsletter more relatable to the audience.

Problem Solving & Data Science

5,610 位关注者

Shaurya Uppal

2 年

Pitfall Summary: 0. A/A Testing Ignored 1. Peeking: Continuous Monitoring of the experiment in fixed sample size testing 2. Random Split on the entire user base is not truly random on power users 3. Ceteris Paribus Issue: Delay Effect 4. Contamination / Spillover effect b/w control and treatment 5. Multiple Testing Problem 6. Simpson's paradox - A/B test on a subset population shows a different trend post full release 7. Novelty and Primacy Effect Ignored

1 次回应

Shrey Batra

Founder @ Cosmocloud, Ex-LinkedIn, Angel Investor, MongoDB Champion, Book Author, Patent Holder (Distributed Algorithms)

2 年

Cfbr

1 次回应

Titaas De

ML Tech Lead @ Roku | Ex - Microsoft, Stanford, IITKGP, Glance | Blessed Dad

2 年

Awesome Shaurya Uppal

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

A/B testing at Netflix, Uber, Pinterest, Google, LinkedIn, and Spotify- in easy words

Shaurya Uppal

Data Scientist | MS CS, Georgia Tech | AI, Python, SQL, GenAI | Inventor of Ads Personalization RecSys Patent | Makro | InMobi (Glance) | 1mg | Fi

What is A/B Testing?

The crux of A/B Testing

1. A/A Testing Ignored

2. Power Analysis and Data Peeking

3. Randomization

领英推荐

4. Contamination Issues

5. Post experiment Analysis

Conclusion

I hope you learned something new from this post. If you liked it, hit ?? and share this with others. Stay tuned for the next one!

Problem Solving & Data Science

5,610 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

5 Tips to master Experimentation and scale up

A Comprehensive Getting Started Guide to A/B Testing

For AI / Web3: Think Spotify, Not Google / Facebook (/ Yahoo)

Empirically Sound AB Testing: A Guide for Non-Statisticians

TESTA’s crowdsourced solutions for iGaming software development

ChatGPT's 100m users in 2 months is more impressive than you might think

OTT Application Testing: Our Own Experience at TestFort

How A/B Testing Works

Understanding Churn Rate in Netflix and Leveraging Machine Learning for Reduction

Achieving Two-Pass Encoding with ffmpeg

What is A/B Testing?

The crux of A/B Testing

1. A/A Testing Ignored

2. Power Analysis and Data Peeking

3. Randomization

领英推荐

4. Contamination Issues

5. Post experiment Analysis

Conclusion

I hope you learned something new from this post. If you liked it, hit ?? and share this with others. Stay tuned for the next one!

Problem Solving & Data Science

5,610 位关注者

Personalization at Lyft | Ride Hiding Experience Ultra Pro

2023年4月18日

My Entry to the Data Realm - The HandShake: Part One

2023年4月5日

Cracking the Birthday Code: The Birthday Paradox

2023年2月17日

Data Scientist rescuing Mr. Wolf to build a Classifier

2023年1月12日

You Won't Believe the Insights a Data Scientist Uncovers about Google Maps!

2022年12月31日

Ads Personalization like Google: AdRank to increase relevance and maximize revenue

2022年10月17日

Building a Real-Time Player Matching Algorithm for Chess.com

2022年9月13日

The Pinterest way to measure Ads

2022年8月16日

Machine Learning-Powered, Pairwise Ranking of Reviews by Relevance (Part Two) - My First Research Paper Project

2022年7月25日

Machine Learning-Powered, Pairwise Ranking of Reviews by Relevance (Part One) - My First Research Paper Project

2022年7月8日

社区洞察

其他会员也浏览了

5 Tips to master Experimentation and scale up

A Comprehensive Getting Started Guide to A/B Testing

For AI / Web3: Think Spotify, Not Google / Facebook (/ Yahoo)

Empirically Sound AB Testing: A Guide for Non-Statisticians

TESTA’s crowdsourced solutions for iGaming software development

ChatGPT's 100m users in 2 months is more impressive than you might think

OTT Application Testing: Our Own Experience at TestFort

How A/B Testing Works

Understanding Churn Rate in Netflix and Leveraging Machine Learning for Reduction

Achieving Two-Pass Encoding with ffmpeg