All strata are useful, but some strata are more useful than the others
Jayakrishnan Vijayaraghavan
I've mastered half a dozen hard-learned ways of what works in Data Science (and a thousand ways of what doesn’t). Happy to share, learn, and evolve!
Written in collaboration with Athulya Ganapathi Kandy
NOTE: This is the first part of a series of posts discussing variance reduction techniques in A/B testing
Introduction
The biggest adversary an analytics professional encounters in their experimentation journey is (statistical) insignificance, which is the antithesis of what A/B tests promised us - a scientific and "gold standard" of determining causality. Yet this is what many of us in the online experimentation world (I'm not qualified to speak about the offline world yet) deal with, day in and day out - critical "primary" metrics which seems immovable for a feature that every market analysis and opportunity sizing and simulation promised to be a runaway hit. When asked for a reason why this is the case, we, analytics professionals would love to use this phrase - "The test is probably underpowered", which is really a fancy way of saying, "Yo, this metric is all over the place; we really don't know if we (our feature) moved the metric or something else".
However, what if there was a way to break down the overall variance (spread in the metric) into the variance attributed to the feature and the variance attributed to something else? What if, in this process of decomposing the variance into multiple components, we could eliminate one of those components? Alex Deng et.al provided a path-breaking solution to this same conundrum in their 2013 seminal paper on "Improving the Sensitivity of Controlled Experiments by Utilizing Pre-Experimental Data". This paper introduces the concept of using pre-experimental data to reduce the variance, popularly known as CUPED, now-a-days. On their path to introducing CUPED, they touch upon an underrated concept known as Stratification, which looks like a breadcrumb to this path-breaking concept. In this blog post, we are going to discuss why they left this breadcrumb there in the first place and why understanding them is as important as understanding the core concept of CUPED itself.
Harnessing Strata for variance reduction
What's a strata
Any analytics professional dealing with a consequential analysis would have dealt with segmentation analysis, where they endeavor to analyze key segments in order to construct a comprehensive narrative around the overarching data patterns. How they arrived at those segments is mostly attributed to their Subject Matter Expertise (SME) or institutional memory (which again is a consequence of collective SME). It's these segments that the paper refers to as strata. The paper postulates that a "strategically" selected strata could help in our endeavor of reducing variance. We are emphasizing the word "strategic" because that's an oft overlooked assumption while trying to understand the implication of stratification in reducing variance.
The intuition behind stratification
Imagine we want to understand the impact of a supplemental diet provided to randomly selected students in a school. We observe two things:
We know that this is because the overall variance of the heights of the students is too large. Now, assume that the school has just two different grades: 6th, and 12th grade (that would be strange, but stay with me). Intuitively we know that, students are more likely to have similar ages and developmental stages within each grade, which can contribute to similar heights and thus a smaller variance in heights within each grade.
If we know that our treatment has impacted both the groups positively, we would be able to converge to significance much sooner, since the within-group variance is much smaller, almost bypassing the need to consider the between-group variance. It is this intuition that the paper leverages to decompose the overall variance into two components.
Overall variance = Between-strata variance + Within-strata variance
More formally,
领英推荐
Stratified mean and variance
Now that we know that we can eliminate the between-strata variance using stratification, we still need to compute the Average Treatment Effect (in our case, the overall impact of the diet on the heights). Generally, to compute the Average Treatment Effect, we need the difference in means between treatment and control i.e., ?? (= ??_treatment - ??_control) as well standard deviations of the treatment and control (??_treatment and ??_control). When stratification is deployed, things change just a bit. Instead of computing the average of the treatment (??_treatment) and control (??_control), we compute the stratified mean of treatment (??_treatment_stratfied) and of control (??_control_stratified). And instead of computing the overall standard deviation of treatment and control, we calculate the stratified variances of the treatment and control groups (??^2_stratified_treatment and ??^2_stratified_control)
How do we do that? Well, I'm glad you asked. Here's the Python implementation for the same.
To use this function, you would pass your DataFrame, the column representing the strata, and the column representing the target variable. The function will calculate and display the stratified mean and variance for each stratum, as well as compare it with the unstratified mean and variance.
Note: The code assumes that the necessary libraries (such as pandas) are imported before using this function.
Is it time for calling Hakuna Matata?
Not yet. When I ran the code for randomly (or rather carelessly chosen strata), my stratified variance didn't decrease by much. I'm using a sample Sales data from Kaggle to illustrate this.
In one case, the stratified variance actually increased.
Why did this happen? I could start blaming the lack of my domain knowledge about this particular data. Or the fact that the within-group variance is sometimes larger than between-group variance for the chosen starta (="Country"). Rather, the actual problem is that I overlooked a fundamental assumption for the above mentioned "axiom" - which is that the means of the groups need to be different for the stratification to work. In the last stratification we computed, the means are pretty close to each other, ultimately rendering the efficacy of this technique to reduce overall variance.
Now, let's use a stata, whose segment means are farther apart.
It seems that "DEALSIZE" is a much better strata to segment the data by, since it reduces the variance by 77%. Which in turn means, far fewer samples to converge to significance.
Conclusion
In summary, understanding and harnessing stratification as a variance reduction technique, coupled with well-chosen strata, can significantly improve the sensitivity of controlled experiments, enabling more accurate and impactful insights with fewer samples required to achieve statistical significance.
Now, we can say Hakuna matata.
I've mastered half a dozen hard-learned ways of what works in Data Science (and a thousand ways of what doesn’t). Happy to share, learn, and evolve!
1 年PART II: https://www.dhirubhai.net/pulse/capping-low-hanging-fruit-you-should-pick-unless-vijayaraghavan/ PART III: https://www.dhirubhai.net/pulse/cuped-what-you-know-before-experiment-matters-much-vijayaraghavan/