The Fault in our Sampling
Data sampling is a necessity for every possible hypothesis testing imaginable, starting from a simple testing out of curiosity, e.g., how many people above the age of 65 like dogs? to complex business problem statements, e.g., what possible changes in a metric could lead to a maximum conversion of potential customers?
As of today, we follow certain basic sampling techniques, the most talked-of ones including: simple random sampling, stratified sampling, systematic sampling or cluster sampling. For applying any of these techniques, we would first need a "target population" from which to derive the samples. In most cases, this target population is the most generic audience whom we can identify.
With that said, let me try to elaborate my concern a bit here. Say a company by the name of XYZ is performing an A/B testing on an important change in one metric in a social media site, the samples to test were chosen from two groups, group A, which was exposed to the unchanged metric, while group B to the metric post the change. We can consider for the sake of an example, the metric in question is the colour of the "Sign up" button. The researchers in XYZ have a theory, the colour blue tends to appeal more to a human eye than any other colour. They now want to test this theory to see if, changing the colour from yellow to blue results in an increase in the conversion rate by at least 2%.
And therefore, they wish to prove that if the conversion rate of group B is r(A)+0.02, where r(A) is the rate of conversion of group A, then their theory can be stated correct. The ideal and best case scenario to test would be to make the same group of people test against changed and unchanged buttons, at the same exact timestamp, in order to avoid any discrepancies that arises due to the change in a person's mindset/ situation depending on the time. In such a scenario (which is practically impossible to achieve) we would be sure that our 2% increase in rate is justified.
However, in the absence of such a device which allows us to perform this (a person cannot perform two tasks at the same exact timestamp), we need to consider an alternative approach to make group A and group B as similar to each other as possible (in terms of the person's mentality as well as situation), while the test was being performed. The statement here is simple: we can't compare miles with kilometres.
However, if, prior to sampling, they choose to define their target population as anyone who visits their website in a given timeframe, which in they are very likely to do, then even before they derive the needed samples, they run a very high risk of supremely misguiding results, because it might be that the customer who got converted in group B did so under very different circumstances than a person who refused to get converted in group A.
If this problem statement sounds soluble by multi-stage cluster sampling approach, then to clarify: the purpose here is not of granularity, but co-occurrence. The approach of random selection of clusters from among atomic clusters, may not be the best possible way to also tell if the clustered entities essentially also belong together. An example of multistage clustering, along with the mentioned problem, is listed as below:
In other words, if yellow represents "interested visitors" and blue represents "forced to sign in", we first need to be adequately sure that these two categories should and often do occur together.
The Proposition
Hereby, I propose a new type of cluster sampling, which can be called Informed clustering.
Step1: Clustering of target population based on critical parameters - Apriori bucketing, based on repeating or frequently co-occurring item sets.. The attributes can be chosen as (these are some example instances)
We may see the following patterns on careful Apriori bucketing, as shown in the table below:
From this information, X1(30-40), X3(10+), X2(contains site's subject) and X6(0) seem to co-occur most of the times, and X1(30-40), X2(contains site's subject) and X6(0) seem to co-occur always.
This simply means that anyone who is above the age of 30 but below 40, his/her interests contain the subject matter of the website and he/she has visited the website more than 10+ times (in given timeframe, maybe one day), are the most likely to exhibit properties of a spontaneous sign up.
With this in mind, we may include the categories with these specific buckets in mind (for both groups A and B).
In this type of sampling, we can tell with more certainty that the attributes selected co-occur together, unlike as in the previous example. The most important question arises, which attributes are the most closely correlated to X6(0)?
In our case, X1(30-40), X3(10+), X2(contains site's subject) would be the answer. Hence if, a person from group-A who is between 30 to 40 years of age, chooses to ignore the sign-up button, whereas another person in the same age-group from group-B chooses to get converted, then we have a real comparison whereby spontaneous sign-up X6(0) wins in group-B over that of group(A), proving the hypothesis that the colour of the button is indeed one of the only probably criterion which could have resulted in the change of heart.
领英推荐
From here, we frame normal distribution curves to determine the shift detected in mean, which is ideally 0.02, as foretold by the hypothesis.
The first curve (red) represents the normal distribution curve of group-A's conversion while the blue curve represents the same for group-B's conversion rate. The formula for paired t-test given as:
where,
mu - x(bar) represents the difference in means
SD is the standard deviation of the normal distribution N(mu, sigma) = N(0,1) for standard normal distribution
n represents the population size considered
And hence, we now have a more solidified outcome with proof that indeed the metric which we thought about changing (in our case, the colour), has had a positive (or negative) impact for sure on the conversion rate for the website.
This approach, as previously mentioned, saves us, to some extent, from having to fall victim to mistakes in sampling that may arise from assumptions related to why the new metric was favoured (or if at all it was favoured) over the existing metric.
One caveat, however, could be that how much data are we allowed to save per visitor that ultimately allows us to draw a flawless sampling with no wrong assumptions behind the conversions? Well, that could be debatable, but in my opinion, the amount of data that a website gets to collect can depend largely on its content and on its majority traffic.
In conclusion, though it may never be entirely possible to make 100% accurate assumptions in sampling, but this was one alternative approach from the vanilla A/B sampling technique, that I found to be more effective/interesting. I'm open to discussions on improvement on this article or even any alternate views on the same! :)