Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data
Hyperloglog

Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

In the world of big data, one common problem is the count-distinct elements problem, also known as the cardinality estimation problem. This problem involves determining the number of distinct elements in a large dataset. A naive approach would involve storing all elements and checking for duplicates, but this requires a lot of memory, especially when dealing with big data.

Let’s consider an example. Suppose we have a popular social media platform with billions of users, and we want to find out how many unique users visited our platform on a given day. Storing every single user ID in memory could be impractical due to the sheer volume of data. Also here we might be okay with an estimated unique count rather than an exact count. In such a case hyperloglog is a very useful algorithm that helps to estimate unique count with less memory requirement rather than finding the exact value.

The HyperLogLog Algorithm

The HyperLogLog algorithm, proposed by Philippe Flajolet, éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007, is a probabilistic algorithm that provides an estimate of the number of unique elements (the cardinality) in a dataset, using significantly less memory than would be required for exact computations.

How Does It Work?

The HyperLogLog algorithm works by dividing the input into buckets and using the hash of each element to determine which bucket it belongs to. The first few bits of the hash are used to determine the bucket index, and the remaining bits are used to estimate the cardinality within each bucket.

void 
HyperLogLog::add(uint32_t obj) {
    uint32_t x = HyperLogLog::MurmurHash(obj); // calculate the hash value of item
    uint32_t j = HyperLogLog::extractMSB(x,b); // use the first b bits as index to find bucket
    uint32_t w = x << b; // remaining bits as bit sequence
    // count number of zeros at   beginning for remaining bits
    bucket[j] = std::max(bucket[j], HyperLogLog::countTrailingZeros(w)); 
}        

The algorithm then counts the number of leading zeros in the binary representation of each hashed element. The intuition behind this is that roughly one in two hashes will have a leading 1, one in four will have two leading zeros, one in eight will have three leading zeros, and so on. By keeping track of the longest run of leading zeros seen for each bucket, the HyperLogLog algorithm can estimate the number of unique elements.

Here is an example where 2-bit precision is used in the hyperloglog algorithm. Which creates four buckets to store value. From the given hash value first two bits are used to determine the bucket and the initial trainling number of zeros from the remaining bits determines the actual value. If the bucket value is smaller than the newly calculated value, then the bucket value is updated.

Example

Finally, the algorithm combines the estimates from all the buckets to produce a final estimate. This is done using the harmonic mean, which helps to give more weight to lower estimates, reducing the overall error rate.

Example Use Case

Let’s consider a practical use case. Suppose we’re running a global online retail store and we want to estimate the number of unique visitors to our website daily. With the HyperLogLog algorithm, we can do this efficiently without needing to store every single visitor’s ID in memory.

We would hash each visitor’s ID to determine the bucket it belongs to and count the number of leading zeros. At the end of the day, we would calculate the harmonic mean of the estimates from all the buckets to get an estimate of the total number of unique visitors.

If you are interested in understanding more in detail, here is the original paper.

I have done a basic implementation of the hyperloglog algorithm which can be found at Git Hub.

The HyperLogLog algorithm is a powerful tool for cardinality estimation in big data scenarios. It provides a balance between accuracy and memory usage, making it possible to estimate the number of unique elements in extremely large datasets with reasonable accuracy and minimal memory footprint.


要查看或添加评论,请登录

Manish Joshi的更多文章

社区洞察

其他会员也浏览了