A/B Testing Mastery Review
I got a "Growth Marketing Mini-degree" scholarship from CXL Institute. And I will be writing a weekly review of what I am learning along the journey.
Here's my review for week 4.
In the last week, I talked about the "PXL test prioritization framework" for prioritizing A/B tests, the value of A/B testing, and how to know if your website is qualified to run A/B tests.
This week I will talk more about A/B testing.
As I went through this part I come across some statistical terms that didn't make sense for me. Although the course has a dedicated section that talks about "statistics for A/B testing" It still assumes a pre-required basic knowledge of inferential statistics, which I know nothing about. So I had to step back and go back to the statistics class!
I used this book as my guide for getting a reasonable intro to inferential statistics using this book, I decided to share it as you may find it useful. (It's not affiliated)
Now, let's go back to the course starting with some experimentation related statistics :
Statistical power vs Significance:
? Statistical power: is the likelihood that an experiment will detect an effect when there is an effect to be detected.
? Power depends on (sample size, effect size, significance level)
? Significance indicates the degree of rarity required of an observed outcome in order to reject the null hypothesis (H0)
? As a rule of thumb; you should set significance high enough to 90-95 % to be able to declare a real winner not a false positive. While power should be greater than or equal to 80% to be able to detect a winner if there is a winner to be detected.
If we set the "reality" on the x-axis and the "measured hypothesis" on the y-axis you will get 4 possible outcomes
? You reject the null while it's false -> the right decision.
? You reject the null while its false-positive -> type 1 error.
? You accept the null while its true -> the right decision.
? You accept the null while its false-negative -> type 2 error.
? False positives have to do with significance levels, while false negatives have to do with power levels.
? The possibility that "type 1 error" does hurt your business is relatively low. On average it's a flat line with no effect on your business but as a CRO who wants to improve your test efficiency, you should decrease this type of error as it will decrease the portion of your successful tests.
? So when you have an outcome that is not significant, it does not mean that your challenger is not good or not working or not making an impact, but we would say just you were not able to detect the impact with your A/B test.
? "Type 2 error" represents a big problem for your business.
Which KPI to pick when running A/B test?
? KPI is not the term used in A/B testing instead it's "Goal Metic", which is what you are optimizing for in A/B testing.
? You can pick "goal metric" like (clicks, behavior, transactions, revenue per user, potential LTV)
? Clicks are easy to get uplift for but do not mean a lot for your business's bottom line or growth.
? Behaviour is good to twick in your A/B testing especially if your website is too low in transactions to run A/B testing.
? Transaction is what you should go for if you dare to grow your business and it can be a lot of things; transactions, leads, or if you are publisher it may be some other thing.
? For more mature companies, they can go for revenue per user and LTV (customer lifetime value).
When choosing a goal metric you should make sure it fits everyone in the company as you may optimize for goal metrics that badly affect other teams; think of optimizing for mortgage/loans while other teams got affected negatively because of a low number of saving accounts.
? To solve the above problem you must come up with an overall evaluation criterion (OEC). Coming with such an "overall criterion" is not easy but is really driving long-term value to your business.
? OEC/OMTM (only metric that matters)/North-star metic: are all the same thing except for OEC while running A/B testing; when you start to talk to different departments you will come to a conclusion that it should not be one metric but "a weighted sum of metrics" so you end up saying "if that important metric is positive then we will implement this no matter which of other metrics is negative unless that other metric is negative below certain level" so that's the kind of the way it works.
? For mature companies OEC is good but for starting ones just stick to OMTM.
User research to get insights for your A/B tests:
In the hierarchy of evidence that is used to order different ways for developing medical treatments according to their trustability, it goes like this;
? randomized control trials
? talking to people
? expert opinions
? Although pinions are at the bottom and trials at the top, we need to dive into the middle stage; "user research" before running randomized trials to be able to improve our success rate. But expert opinions tend to be biased so not reliable for starting A/B test.
? But in general, without an analyzed reason to start a test, it does not make sense to start. Don't just come with an idea for testing and go waste your time.
Customer/user behavior study goals are:
?Insights in the most important customer journies.
?Understanding the basic user behavior.
?Input for setting a hypothesis.
The general scientific method of testing is: make observations, think of interesting questions, formulate hypotheses, develop testable predictions, gather data to test predictions, develop general theories. But for A/B testing we can translate it to Fact & Act model.
Fact & Act model is: (Find, Analyze, Create, Test, Analyze, Combine, Tell)
? Find a problem
? Analyze its reason
? Create a hypothesis
? Test it
? Analyze your results
? Combine different accumulated results together.
? Tell/ represent your findings
? (Find, Analyze) on the customer journey.
? (Create, Test, Analyze) is about developing the hypothesis.
领英推荐
? (Combine, Tell) is all about presenting the findings.
? If we took speed/time into account it looks like this; studying customer journey takes time, developing hypothesis takes less time and finally the test itself is much faster (but you need several tests to be able to prove a hypothesis)
6V research model:
? In this section, Ton Wesseling uses the "6V research model" for performing user research and coming up with a proper hypothesis for A/B tests, to generate user-behavior insights.
6V stands for (View, Voice, Versus, Validated, Verified, Value).
Value: understand what represents value to the company; mission, strategy, short & long-term goals. Know the product focus and the KPI focus.
Versus: its competitive analysis; who are your competitors, is there any best practices in your market that can be used?
?You can do this research by searching for organic or PPC competitors and some tools like Alexa.
?Once you know your audience you can visit their website try their product, go on the whole customer journey. Then track changes on their website.
?This step is really important cause the customer journey is not only on your website; they go from google to your competitor site to your website, so if your competitors make changes, add or delete content, etc you will find an effect on your website.
?You can use tools like "visualping", "wachete", "changetower". "pagescreen" to track changes to your website or competitors' websites.
?You can also check what they use to do A/B testing on their website using "buildwith" then go get a chrome plugin or app to see the current experiments they are running. Mostly, if they are testing something then it's probably a winner especially if they have experimentation people, optimization people, data science people in their team.
View: here we get insights from web analytics and web behavior data.
? Where do visitors start on the website, differences between new/existing customers, differences per device.
? where they come from. Do they have a product in mind, do they know the brand.
? Is their specific flow for those visitors. How the customer journey looks like. what is the CTR and return rate, exit rate, and time on the page for each step. for example you can have a dropoff in PDP to persona info pages.
? Are there noticeable differences between segments or products.
? What is the behavior on the most important pages.
? Once you get deep dive into your analytics and gather such information about what your users are doing on your website, you start creating a flow/funnel.
? You want to know from that flow/funnel:
? All users on your website with enough time to take action.
? All users with at least some interaction on your website.
? All users on your website with heavy interactions.
? All users on your website with clear intent to buy.
? All users on your website that are willing to buy.
? All users on your website that succeed in buying.
? All users on your website that return with the intent to buy more.
? So You need to deploy that flow in your analytics and know how many users you have in each packet and how to drop them down to the next one, and how much time does it take to go from one to the other.
? You also would use heat maps and scroll tracking and screen recordings
? After checking all that you need to report on:
?Users per segment
?Conversion and time it takes to move from a segment to the other.
?What are the important pages, pages where decisions are made.
?What is the detailed behavior on these pages.
? This step is a long one, a deep dive the first time you are conducting user research but then it becomes easier to do.
Voice: talk to CS, watch social media, make users interviews, focus groups, user research, and don't ask users "Why".
Verified: its what we know from the scientific literature about decision-making in general and about the type of products sold specifically.
? There is scientific studies done on every niche, perk, industry, service out there so go look for them on Google, Google scholar, semantic scholar, deepdyve.
? On Google scholar you need to look for studies that have 20-30 citations to be reliable.
? You may find it difficult to go through and read such papers as they are written in a scientific language but the insights you get are very valuable.
Validated: what insights are validated in previous experiments or analyses.
In the end, you will have all your information recorded in a "customer behavior study report"
Ton Wesseling then go preview 3 important theories on user behavior as an example, but worth mentioning:
? For a trigger/CTA to succeed, to get clicked, the motivation of the user needs to be high enough and the ability of the user needs to be high enough. ("Thinking fast & slow" book)
? "system1/system2 model"; "system1" representing the emotional side of the human mind is easily involved in decision making, whether the bigger the decision the more likely that the "system 2" or rational part will start working ("Thinking fast & slow" book)
? Belongingness & conformity theory: we have an innate need to form and maintain strong, stable interpersonal relationships. More than we are often consciously aware, we want to be part of a peer group, community, and society.
Hypothesis setting:
Why you need a hypothesis:
?To get everyone aligned.
?To save you time on having discussions during and afterward the experiment.
? Warning: don't write your research plan after the outcome is known or after the A/B test is over. That's not a proper way to run an A/B test if you seek efficiency.
? A concrete hypothesis formula is: " If (I apply this), then (This behavioral change) will happen, among (This group), because of (This reason)."
? You describe a problem, have a proposed solution, predict an outcome.
To be continued the next week. Stay tuned!