Putting A/B Tests into Practice

Background

The developers of the mobile app Cookie Cats wanted to see how its user base responded to changes in the game. Specifically, they wanted to see the impact of changing the location of the first gate in the game from level 30 to level 40. To accomplish this, the developers randomly assigned players to one of two versions after players installed the app. After completing their A/B test, the developers published the data on Kaggle. For this project, I wanted to do my own A/B tests on the Kaggle data. Before doing so, I think it's important to explain the conditions that make it possible to do a proper A/B test:

  • Sample size: It's necessary to gather enough data points to ensure that the A/B test can gather statistically significant results. The Cookie Cats developers gathered data from over 90,000 users, which is enough to tease out even small statistical differences.
  • Randomization: Players should be equally likely to be randomly sorted into each version of the app.
  • Holding other factors equal: The only difference between the two app versions should be whether the first gate is on level 30 or 40.

Understanding the Data

Before doing any statistical tests, I wanted to see what the data looks like. The developers were considering three key performance indicators (KPIs):

  • Game rounds: the number game rounds each user played in the first 14 days of using the app.
  • One-day retention: whether a user kept using the app one day after installation.
  • Seven-day retention: whether a user kept using the app seven days after installation.

I have three bar graphs below comparing the KPIs for each app version. For game rounds, I took an average for each version. For each retention KPI, I calculated the retention rate: the proportion of players who remained with the app after one and seven days respectively.


How A/B Tests Can Help

From the three graphs above, it looks like each version performs similarly across all three KPIs, with the level 30 version performing slightly better. But it can be hard to tell just by looking at the graphs whether these small differences are important or not. Maybe the level 30 version really is better, or maybe the differences are so small that they are just due to random chance. An A/B test can help address those concerns. But it's important to keep in mind what A/B tests can tell us, versus what they can't tell us.

Let's focus first on average game rounds. An A/B test answers a very specific question: assuming that the average game rounds is the same across both versions, how likely are we to observe the given data, or data even more contradictory to the initial assumption? The assumption that there is no difference between the two versions is the null hypothesis. From the first graph, we see that there is a small difference. If we collected more data, how likely is it that the same difference would persist, or become even larger? An A/B test gives a probability of this outcome called the p-value. Prior to conducting the A/B test, we should decide alpha value. If the p-value is less than the alpha value, we reject the null hypothesis. The alpha value is commonly set at 0.05, but it doesn't have to be. For this project, I went with an alpha of 0.05.

An A/B test cannot guarantee that the null hypothesis is true or not. It only gives a probability. An A/B test also cannot explain why any differences occur. If the level 30 version really does perform better, the A/B test does not show why.

Conducting the A/B Tests

In order to A/B test game rounds KPI, I used the standard t-test from Python's scipy library. Depending on what kinds of assumptions we make about the data, we might want to use a different kind of A/B test. For instance, the test I used assumes that the two versions produce independent results. If the two versions' results somehow influenced each other, I would have to use a different kind of A/B test.

After performing the A/B test, I found that on average, players using the level 30 version played about 1 more round than players using the level 40 version. I came up with a p-value of about 0.37, which is well above the alpha of 0.05. This is not a statistically significant result, so we fail to reject the null hypothesis. We should also consider the effect size. In our sample data, the difference between the two version is again only one round. This seems like an exceedingly small difference. If I was actually working with the developers, I would want to ask what kind of effect size they would consider relevant. Maybe this one-round difference was important to the game's design, but that does not seem likely considering that overall players averaged about 50 rounds.

I also ran A/B tests for the one-day and seven-day retention rates. However, these A/B tests work very differently. When I tested game rounds, I took the average of number of game rounds for each version. For retention rates, I am not comparing averages, but rather I am comparing the proportions of people who continued using the app, which requires a different kind of A/B test. This time, I used the proportions z-test from Python's statsmodels library.

Starting with the one-week retention rate, I found that the level 30 version performed better by about 0.6%, with a p-value of 0.07. We should be very careful with this result. Before doing the A/B test, I decided on an alpha of 0.05. With that alpha value, this result is not statistically significant, but it is close. I could cheat by resetting alpha to 0.1 after seeing the results. This new alpha would make the result "statistically significant" since 0.07 is less than 0.1, but that is very dangerous idea. If I cheated here, nothing about the underlying data or experiment would change. I would just make the experiment look more successful than it actually was by arbitrarily changing the parameters to get the result I want. This is called p-hacking and it should be avoided. It is important to determine the alpha value before conducting the experiment in order to avoid these arbitrary post-hoc changes. Is the 0.07 p-value statistically significant or not? That depends on what alpha I decided on at the beginning, which admittedly isn't a satisfying conclusion. However, we should look at the bigger picture. We once again have a very small effect size of only 0.6%. I would want to consult the developers to make sure, but a 0.6 percentage point difference just doesn't seem important, even if the result is "statistically significant" in terms of the p-value and alpha.

Conducting a similar A/B test for the seven-day retention rate gives a similar result. The level 30 version performs just slightly better than the level 40 version, this time with an effect size of 0.8% and a p-value of 0.002. This time the p-value is significantly smaller the alpha, so we can reject the null hypothesis. The level 30 does likely have a higher seven-day retention rate. However, what I said before about the small effect size still applies. Regardless of the p-value, these two versions perform very similarly to each other.

Conclusion

I performed an A/B test for each of the three KPIs, and similar patterns appeared each time. The level 30 version performed just slightly better each time. I do not think the differences are large enough to impact the game. Only the seven-day retention rate had a statistically significant difference. The question at the start was whether the first gate should be moved from level 30 to level 40, and I think the answer is clearly no. The differences between the two versions is small, but there is minor evidence that moving the gate hurts the seven-day retention rate. Even if we disregard that evidence due to the small effect size, since the two versions are so similar overall, there is no clear reason to change the gate location.

Thanks for reading. I have linked below the original dataset and my Python used to conduct the A/B tests.

Original Dataset: https://www.kaggle.com/datasets/mursideyarkin/mobile-games-ab-testing-cookie-cats

My Python Code: https://github.com/Zach-Nabavian/A-B-Testing-Mobile-App




Meir Wieder, FSA

CVP and Actuary at NY Life

1 年

For the seven day retention in particular, it's a good point about the difference between statistical significance and "real world" significance

Libby Heeren

Posit Data Science Hangout Co-Host | Posit Academy Mentor | Data Humans Podcast Host

1 年

That's awesome, Zachary, I'm so glad you're putting stuff out into the world!

要查看或添加评论,请登录

Zach Nabavian的更多文章

社区洞察

其他会员也浏览了