Redis HyperLogLog: Cardinality Estimation for Massive Data Sets
Managing large datasets efficiently is a challenge we often face. One of the most common requirements is to count the number of unique elements in a dataset. Whether you’re counting unique website visitors, distinct IP addresses, or tracking different hashtags, cardinality estimation is a key operation.
Traditional methods like using hash sets or databases work fine when the dataset is small. But what happens when you’re dealing with thousands, millions or billions of items? Suddenly, memory usage becomes a bottleneck. This is where Redis HyperLogLog comes in—a brilliant probabilistic data structure designed for efficient cardinality estimation, using a tiny amount of memory.
What is HyperLogLog?
HyperLogLog is a probabilistic data structure that estimates the cardinality (number of distinct elements) of a set. Redis, being a highly efficient in-memory data store, has integrated HyperLogLog for precisely this purpose. The key feature that makes HyperLogLog stand out is its ability to estimate cardinality with a memory footprint of only 12 KB, regardless of how many distinct elements you’re tracking—be it thousands or millions!
Sounds too good to be true, right? Well, it comes with a trade-off: HyperLogLog doesn’t give you an exact count, but rather an approximation with a typical error rate of 0.81%.
Why Should You Use HyperLogLog?
How Does HyperLogLog Work?
The magic behind HyperLogLog lies in two key ideas:
The deeper the string of leading zeros in a hash, the rarer that hash is, and the more it tells us about the spread of the data across the hash space. HyperLogLog leverages this rarity as a proxy for how many distinct elements have been encountered.
Note: If all inserted elements have low leading zero counts and are mapped to very few buckets, HLL struggles to provide an accurate estimate. It expects a uniform distribution of leading zeros across many buckets, which doesn’t happen in this case.
Breaking It Down with an Example:
Let’s say you’re adding two distinct elements, "kiran" and "kamath", to a HyperLogLog.
For each element you want to add, HyperLogLog starts by running the element through a hash function (Redis uses MurmurHash internally). This hash function outputs a uniformly distributed 64-bit integer. The randomness and uniform distribution of the hash function are crucial for making sure the leading zeros are fairly distributed across different elements.
The element "kiran" is hashed into a 64-bit binary string. For the sake of simplicity, let’s assume it hashes to:
10101100 00100101 01010010 11100101 10111010 00000010 00101100 11011111
The element "kamath" is also hashed into a 64-bit binary string, say:
00011010 11101100 01100010 01011001 00110100 11001110 10001010 11110110
Let’s say:
"kiran" hashes into bucket 10101100001001 (bucket 27721),
"kamath" hashes into bucket 00011010111011 (bucket 6760).
For each bucket, HyperLogLog tracks the maximum number of leading zeros seen for any element in that bucket. The more leading zeros a hash has, the less likely it is to occur by chance, indicating a larger overall cardinality. Leading zero calculation is done on remaining 50 bits.
In our example:
The hash of "kiran" has 1 leading zero, so bucket 27721 is updated to store this count (assuming it was previously storing a smaller value).
The hash of "kamath" has 3 leading zeros, so bucket 6760 stores 3 (or keeps a larger value if it's seen a hash with more zeros before).
HyperLogLog updates the buckets, but it doesn’t store the actual hash values or the elements themselves. It only records the highest count of leading zeros for each bucket
To estimate the cardinality, HyperLogLog combines the information from all 16,384 buckets. Specifically, it calculates the harmonic mean of the leading zero counts stored in all buckets. The harmonic mean gives more weight to buckets with smaller values (i.e., buckets that have seen fewer leading zeros).
The intuition is that if many buckets have seen large numbers of leading zeros, the data set likely contains a large number of distinct elements. Conversely, if most buckets have only seen small numbers of leading zeros, the data set likely contains fewer distinct elements.
Finally, HyperLogLog applies a bias correction based on empirical analysis, making the estimate more accurate. The result is an estimate of the number of unique elements added to the HyperLogLog.
How Does HyperLogLog Achieve Memory Efficiency?
HyperLogLog’s brilliance lies in how it compresses the data. Instead of storing every distinct element like a hash set, it only tracks the maximum number of leading zeros for elements in each bucket. This allows it to approximate the cardinality without storing all the elements themselves.
Redis uses 16384 buckets internally, and for each bucket, it stores the largest number of leading zeros encountered for any element that hashes into that bucket. Storing just this information across 16384 buckets is how HyperLogLog manages to use only 12 KB of memory!
Adding Elements to HyperLogLog in Redis
Adding elements to a HyperLogLog in Redis is straightforward. Redis provides two key commands for working with HyperLogLog:
PFADD hll:visitors "user1" "user2" "user3"
PFCOUNT hll:visitors
Example: Tracking Unique Website Visitors
Let’s say you’re tasked with counting unique visitors to your website over a day. Instead of storing every user ID (which could be in the millions), you can simply use Redis HyperLogLog to track the cardinality efficiently.
# Add visitors to the HyperLogLog
PFADD hll:unique_visitors "user123" "user456" "user789"
# Check the number of unique visitors
PFCOUNT hll:unique_visitors
You could do this throughout the day, constantly adding user IDs to hll:unique_visitors. At the end of the day, the PFCOUNT will give you a good estimate of how many unique visitors you had.
Estimation Accuracy
It’s important to remember that HyperLogLog is an approximation. Redis HyperLogLog comes with a standard error rate of 0.81%. For example:
This accuracy is more than sufficient for most real-world applications, especially considering the massive memory savings.
Limitations of HyperLogLog
HyperLogLog is not a silver bullet. It’s excellent for cardinality estimation, but there are a few limitations:
Conclusion
Redis HyperLogLog is an incredibly powerful tool for situations where you need to count the number of unique items in a very large dataset, but where you don't have the memory or the need for exact accuracy. It’s particularly useful for tracking things like website visitors, IP addresses, unique items in logs, and more.
With just 12 KB of memory and a constant-time algorithm for adding elements and querying the cardinality, Redis HyperLogLog is a game-changer when it comes to scaling applications efficiently.
Next time you find yourself thinking, "How can I efficiently count millions of unique items without blowing through my memory budget?"—consider HyperLogLog.