Sampling using truncated hash
Let's suppose tens of millions of people visit your website everyday and you want to do ad hoc analysis. However, you cannot use all the data since it is computationally costly and also jobs will take long time. Therefore, a better approach is to have a representative sample. But how can you sample from the data so that a visitor that has been included in today's sample is also included in tomorrow's sample or anytime in the future? In this short post, we will see how to attain that using truncated hash.
But how is truncated hash applicable in this case?
When we apply a hash to the IDs of the website visitors and truncate the first few characters of the hash, the decimal value of the truncated hash is uniformly distributed. Therefore, we can sample from the truncated hash and use those samples to get sample data anytime in the future. Note here that we have to change the truncated hash from hexadecimal to decimal (base 10) to generate the density and cumulative distribution plots. In this post, we will use the first three characters of the hash of the users. In doing so, we get 4096 (16*16*16) unique truncated hash values which can be used as buckets for our sampling.
Full article is available here