Running A/B tests on conversion rates

All businesses spend a lot of time and effort thinking about to improve their top line revenue and bottom line profits. E-commerce retailers try to accomplish these goals by improving traffic, increasing the average basket size, or raising conversion rates. There are lots of changes that sites can make to improve the experience for their customers and hopefully their businesses performance.. They could improve the quality of content on the product display pages, change the organization of product listing pages or overall product hierarchy to make it easier for customers to find products. They could work on improving their sites search engine optimization! Or they could focus on trying to improve conversion rates, perhaps by changing the color and location of the checkout button, making the basket easier for customers to view, or reducing the number of steps required to take prior to check out. Regardless of the changes made to your site, it can be difficult to know what the impact of those changes are and whether any differences you have seen are real and sustainable improvements as instead of just noise and random fluctuations. Evaluating these changes using an A/B testing framework helps ensure that your company is making good business decisions. Today we are going to discuss using A/B testing to measure and improve conversion rates, but before we get there we need to first discuss what A/B testing is, and how it works.

What is A/B testing?

A/B testing is a part of a statistical and systematic approach to making better business decisions using data using experiments. The basic reason for using A/B testing is to ensure that any increases you may have noticed in your conversion rates are due to the changes you have made and are not due to random chance. This is important because changes in conversion rate due to random chance will not result in reliably higher long-term conversion rates, any more than finding a $20 on the sidewalk today is going to increase your take home salary by $100 per week. It was just good luck and you can't count on it happening again tomorrow. Ensuring that the decisions you make today will project into the future are key to making good business decisions and using a rigorous A/B testing approach will help you accomplish that.

To perform an A/B test, your website traffic should be split into two groups, the first group called the 'control group' continues to receive the current web experience, and a second "test group" receives a modified one. The proper way to compare the results of your A/B test is to use a statistical test designed for use. Specifically you should be considering the chi-square test, Fisher's exact test (which has a really interesting story behind it's name), or Bernard's test. In practice the subtle differences between these three tests are not important - just pick one and stick with it. I recommend using a Fisher's exact test in R because it gives you really useful feedback in terms of number it provides.

Fisher's exact test

Fisher's exact tests (and these two other tests) compare the number of successes (in this case, purchases) to the number of visits (which could be the number of visits to your web page, or the number of visits to the cart page, or the number of visits to the start of the purchasing funnel), and then it compares the ratio of these numbers between your two conditions - A and B. In other words, it calculates the conversion rates for these two groups and compare them against each other. After you have done the test calculations (I recommend using R), you will be left with two important numbers. Specifically, you will see the odds ratio, and the p-value.

Odds ratio

The odds ratio tells you the factor that relates the conversion rate for one group to the conversion rate for the other group. In other words, if the odds ratio was 2, then it would say one groups' conversion rate was double the others. And who wouldn’t want that?

P-value

The second important number is the p-value. This number is the probability that the odds ratio you observed could have happened by chance (there is a little more nuance to it than that, but those explanations will just muddy the water, so I'm going to ignore them here). If your p-value was 0.35, then that would suggest that 35% of the time we would expect to see an odds ratio of 2 just due to chance. In other words, this difference between the two groups is probably not meaningful. In which case you should be very careful about making any changes to your site based on this information. It’s probably not meaningfully different!

On the other hand, you could have an odds-ratio of 1.1 (a ten percent increase in conversion rate) with a P-value less than 5% (e.g. P < 0.05), then it would indicate that 5% of the time (or less) we would expect to get an increase in your conversion rate of 10% due to random chance.

Interpreting your P-value

You see, the P-value is calculated by the magnitude of the difference between the two groups but also the number of visitors to each experience. If your web sites standard experience had 10.000 visitors and a conversion rate of 1% and your newly modified experience had 20 visitors and 2 purchases, the odds ratio improvement would be outstanding! But also meaningless because it’s based on so few observations.

The 'standard' approach is to say when the P-value is less than 5% then we should consider those tests to be statistically significant. In other words, the measured differences probably did not occur due to chance. But this value of 5% is rather arbitrary and it's because Fisher suggested 5% is a good threshold, but he didn't have any particular reason for it. You have to decide for yourself when you want to consider a change to be meaningful, but 5% is a reasonable starting point.

In some ways it's a trade-off between moving fast and moving carefully. A smaller P-value should give more confidence that these observed differences are real, but it also takes more time to gather more data, so there is a trade off there. Now many decisions in business are reversible, so you should consider how easy it is to reverse a decision when figuring out what P-value you want to use.

Similarly, you want to me sure that you have collected data over a long enough period of time for it to be able to be generalizable to the future. If you ran your A/B test during Black Friday sales (don’t do this!!) and saw your conversion rate in your test group was 15% higher, that difference might not scale to the rest of the year. The buying behaviour during that part of the year is so different, it can be risky to generalize from it too much. In that case, you could still get a statistically significant p-value that wouldn’t necessarily project well into the future.

Binomial test

Others have suggested that the appropriate statistical method for comparing the results of an A/B test is the binomial test, but this is bad advice. In fact, reading this advice given repeatedly on LinkrdIn is what prompted this article in the first place. Don’t do it!

A binomial test allows you to compare whether one observed results of series of binary outcomes is different from a known probability. When people have advocated using a binomial test to study conversion rates, they would use the conversion rates calculated from the control group and then they would ask if the data from test group is different from that. In that sense, it's similar to using a contingency table test like a Fisher's exact test, but it's incorrect. Fisher's exact test explicitly accounts for the fact that the measured conversion rates for each of the groups in your test are actually just estimates and not they're not fixed values, and it accomplishes this they do this by including the total number of trials (or visits) for each of those two groups. This allows the statistical test to understand the 'certainty' that each of these two estimates are correct.

The appropriate place to use a binomial test is when you are comparing your observed data to known quantities: for example, the probability of getting heads out of a fair coin. We know that is 50%. A fair coin toss should give you heads 50% of the time and tails 50% of the time over a long enough run. If you have run an experiment counting heads and tails for a coin and you want to know if this coin is 'fair' then a binomial test is the most appropriate tool.

There are other ways to calculate whether or not conversion rates differ between two groups, for example by comparing the conversion rates between the two groups against each other directly using a two-sample t-test. And in fact some very expensive and very well known web analytics packages do (check your documentation if you want to be sure, the details are in there).

In closing

There is a lot more to understanding how to properly set up and run an A/B test for e-Commerce, this is just touching on a single element. I'll write some more on this and related topics soon!

Please let me know if you found value in this and if you'd like to get more. I'm open to suggestions for topics you'd like to see covered. Other topics I am thinking about covering would be comparing average basket sizes, and looking at dynamic customer segmentation vs static rules based customer segments.

Oh and ps. I write most of these posts from my iPhone while I am putting my kids to bed, usually trying to get the youngest one to fall asleep on my shoulder. I do my best to try re-read and edit these posts after the fact, so please forgive typos, spelling mistakes and grammatical weirdness. Point them out to me and you will gain my gratitude and I will do my best to address them quickly. :)

Andrew Donaher

Vice President & National AI, Data & Analytics Leader at CGI | AI, Digital, Sustainability | Executive Leadership | Board Member & Advisor

6 年

Brad this is a great article, brilliant and accessible explanations! Super helpful to anyone getting going in analytics. Please do write those articles you mentioned in your closing paragraph I am sure many of us would love to hear your thoughts on them. And maybe one on recommendation engines based on images and visual preferences not just transactions... you can write that once the little one is asleep! ????

要查看或添加评论,请登录

Brad Davis, PhD的更多文章

社区洞察

其他会员也浏览了