A/B testing at Netflix, Uber, Pinterest, Google, LinkedIn, and Spotify- in easy words
AtoZ understanding of A/B testing

A/B testing at Netflix, Uber, Pinterest, Google, LinkedIn, and Spotify- in easy words

What is A/B Testing?

A/B testing is a data-driven approach for making business decisions instead of guesswork. It is also known as split?testing, where users on the platform are split by a randomized experimentation process into two or more variants; the control group (original or older version) and the treatment group (new version).

The crux of A/B Testing

A/B-Testing is simple but not easy. It is one of the toughest data skills to acquire.

1. Many think it is a simple split of users randomly on a 50-50 basis, keeping one set in the control variant and another set in the treatment variant.

2. Run the experiment for a business cycle or till you get the reach the required sample size from power analysis.

3. Conduct hypothesis testing, and find the p-value; if the p-value is less than the significance level reject Ho (null hypothesis).

No alt text provided for this image
Looks simple in theory but there are a lot of things to check if control and treatment got contaminated, if the random split was right or not, check for novelty and primacy effect, verify variants internal change with A/A testing, etc. In this Newsletter, we would be exploring the pitfalls of A/B Testing.

I’ll hold your hands through common mistakes that happen at the pre-, at-, and post-experiment stages.

1. A/A Testing Ignored

A/A test pits two exactly identical pages against each other. The goal of an A/A test is to check that there is no difference between your control and variation versions. A/A testing is a good way to run a sanity check before you run an A/B test. This should be done whenever you start using a new tool or go for a new implementation. A/A testing in these cases will help check if there is any discrepancy in data.

Pitfall 0: A/A testing gives us a sanity check that there are no internal variant breaks and no issue with the engineering team's instrumentation.

2. Power Analysis and Data Peeking

Power

  • Power Analysis is a method that helps us calculate the required sample size before concluding the results of A/B Testing.


Let us check an online sample size calculator -

No alt text provided for this image

The sample size increases if the minimum detectable effect decreases. MDE is a number set in discussion with business and product managers; it explains what is the smallest acceptable difference between the treatment and control groups? It helps understand what minimum effect is worth it, considering the business costs to update (e.g., engineering time, business cycle, bad UX, etc.)

Pitfall 1: Not understanding power analysis and end the experiment prematurely is a very common mistake.

Data peeking is checking experiments results prematurely before meeting sample size requirements. In classical A/B Testing -> You cannot peek and rely on results early, one has to wait for a required sample size collection.

Don’t end the experiment prematurely because you have seen a positive uptick in the treatment group in the first few days (False Positive Case).

I understand data peeking has a business advantage: stopping the experiment early before sample size collection can help not lose the opportunity cost of earning dollars (in case of a positive outcome) or lose dollars (in case of a negative outcome).

Solution: For early stopping and continuous testing - Sequential A/B Testing needs to be Adopted instead of a Classical A/B Testing setup.
No alt text provided for this image

3. Randomization

Often this 50-50 split is done on the entire user base level which may potentially be the cause for failure of the A/B-Testing experiment. (This problem is very common and the majority of data scientists go wrong with this; added a screenshot of a fellow data scientist pointed out the issue aptly).

No alt text provided for this image

?I will share my experience, I have seen folks doing A/B testing at random where user distribution is not stratified. Only 5% of the user base were heavy users and a random 50-50 split of all the users does not guarantee an equal split of heavy users in the control and treatment groups.

Pitfall 2: Heavy users or power users are the "major drivers" of the platform. If this split goes wrong A/B testing experimentation key principle fails. A random assignment fails to distribute heavy users equally.

People assume an equal split on a bigger set randomly is equal to a random equal split on each group level which is not true.

For the pitfall shared above, I would solve this by doing a stratified sampling in each user activity group. First sample; heavy user group then medium user then the less active users with 50-50 split.

4. Contamination Issues

  • Ceteris Paribus
  • Spillover effect

Pitfall 3: Not taking ceteris paribus into account.
No alt text provided for this image

Scenario: Let us say, we are running an A/B test on a webpage. In the control variant, the 5. web page loading time is 5seconds and in the treatment variant, the web page load time is 8seconds.

This is not an equal comparison of the variants. If you want to measure the impact of control vs treatment variant web-page UI, you need to make all other parameters equal i.e. adding a delay factor on control of 3seconds on load time in the control variant as well (in the above case).

Pitfall 4: Cross-contamination between the treatment and control groups.

Cross-contamination happens when a visitor clicks on some different content, has an unplanned experience, or makes some other impressions in between arriving at your site and landing on the A or B test page.

Example: You are running A/B testing on the app based on the city level split of users. If I travel from New Delhi to (Gurgaon) Haryana. I get exposed to two different variants.

5. Post experiment Analysis

Pitfall 5: Wrong significance level set in Multiple variant testing. Getting fouled by False Positive in Multiple testing problem.

Multiple Testing Problem

- Set α level to 5%

-  P(False Positive) = 5%: no effect but falsely concludes there is an effect
- P(No False Positive) = 1 — 5% = 95%
- For 3 comparisons, P(at least one FP) = 1 — (95%)^3 = 14.3% ~ 14%        

For 3 groups, we see the Probability of a false positive becomes 14%, this is a multiple testing problem. To solve this problem; we can apply Bonferroni correction i.e. new significance level = α/n i.e. 0.05/3 = 0.016

Simpson's Paradox

Pitfall 6: Individual and aggregated patterns look different for ramp-up experiments.
No alt text provided for this image

Novelty and Primacy Effect

Novelty Effect: Existing users want to try out all-new functions. Due to a new launch feature users try it for the initial few days and then don't check it.

Primacy Effect: The primacy effect is the tendency to remember the first piece of information we encounter better than information presented later on.

Solution: The best way of teasing out both effects is to resort to new users who have not been exposed to the old version.

Pitfall 7: Fail to check the primacy and novelty effects, biasing the treatment effect.

Conclusion

Learning from mistakes and understanding where things can go wrong is the best way of mastering a skill. I would again like to repeat and conclude that A/B testing is simple but not easy, it has a lot of caveats/checks; there are many ways of doing A/B testing wrong but only one right path. Ideal A/B Testing is hard to achieve.

Recommended Posts: Learning about p-hacking

I hope you learned something new from this post. If you liked it, hit ?? and share this with others. Stay tuned for the next one!

Connect, Follow or Endorse me on?LinkedIn?if you found this read useful. To learn more about me visit:?Here

Reference:

Disclaimer: I don’t endorse any brand. Netflix, Uber, Pinterest, Google, LinkedIn, and Spotify names are used to make the Newsletter more relatable to the audience.

Shaurya Uppal

Data Scientist | MS CS, Georgia Tech | AI, Python, SQL, GenAI | Inventor of Ads Personalization RecSys Patent | Makro | InMobi (Glance) | 1mg | Fi

2 年

Pitfall Summary: 0. A/A Testing Ignored 1. Peeking: Continuous Monitoring of the experiment in fixed sample size testing 2. Random Split on the entire user base is not truly random on power users 3. Ceteris Paribus Issue: Delay Effect 4. Contamination / Spillover effect b/w control and treatment 5. Multiple Testing Problem 6. Simpson's paradox - A/B test on a subset population shows a different trend post full release 7. Novelty and Primacy Effect Ignored

Shrey Batra

Founder @ Cosmocloud, Ex-LinkedIn, Angel Investor, MongoDB Champion, Book Author, Patent Holder (Distributed Algorithms)

2 年

Cfbr

Titaas De

ML Tech Lead @ Roku | Ex - Microsoft, Stanford, IITKGP, Glance | Blessed Dad

2 年

Awesome Shaurya Uppal

要查看或添加评论,请登录

社区洞察

其他会员也浏览了