登录查看更多内容

Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

Manish Joshi

Senior Software Engineer

发布日期: 2024年5月3日

In the world of big data, one common problem is the count-distinct elements problem, also known as the cardinality estimation problem. This problem involves determining the number of distinct elements in a large dataset. A naive approach would involve storing all elements and checking for duplicates, but this requires a lot of memory, especially when dealing with big data.

Let’s consider an example. Suppose we have a popular social media platform with billions of users, and we want to find out how many unique users visited our platform on a given day. Storing every single user ID in memory could be impractical due to the sheer volume of data. Also here we might be okay with an estimated unique count rather than an exact count. In such a case hyperloglog is a very useful algorithm that helps to estimate unique count with less memory requirement rather than finding the exact value.

The HyperLogLog Algorithm

The HyperLogLog algorithm, proposed by Philippe Flajolet, éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007, is a probabilistic algorithm that provides an estimate of the number of unique elements (the cardinality) in a dataset, using significantly less memory than would be required for exact computations.

How Does It Work?

The HyperLogLog algorithm works by dividing the input into buckets and using the hash of each element to determine which bucket it belongs to. The first few bits of the hash are used to determine the bucket index, and the remaining bits are used to estimate the cardinality within each bucket.

void 
HyperLogLog::add(uint32_t obj) {
    uint32_t x = HyperLogLog::MurmurHash(obj); // calculate the hash value of item
    uint32_t j = HyperLogLog::extractMSB(x,b); // use the first b bits as index to find bucket
    uint32_t w = x << b; // remaining bits as bit sequence
    // count number of zeros at   beginning for remaining bits
    bucket[j] = std::max(bucket[j], HyperLogLog::countTrailingZeros(w)); 
}

The algorithm then counts the number of leading zeros in the binary representation of each hashed element. The intuition behind this is that roughly one in two hashes will have a leading 1, one in four will have two leading zeros, one in eight will have three leading zeros, and so on. By keeping track of the longest run of leading zeros seen for each bucket, the HyperLogLog algorithm can estimate the number of unique elements.

Here is an example where 2-bit precision is used in the hyperloglog algorithm. Which creates four buckets to store value. From the given hash value first two bits are used to determine the bucket and the initial trainling number of zeros from the remaining bits determines the actual value. If the bucket value is smaller than the newly calculated value, then the bucket value is updated.

领英推荐

Data Cleaning - Sort Values

Mage 3 年前

July Data News

CastorDoc 2 年前

Demystifying Big Data, Understanding its Significance…

OHANA? 9 个月前

Finally, the algorithm combines the estimates from all the buckets to produce a final estimate. This is done using the harmonic mean, which helps to give more weight to lower estimates, reducing the overall error rate.

Example Use Case

Let’s consider a practical use case. Suppose we’re running a global online retail store and we want to estimate the number of unique visitors to our website daily. With the HyperLogLog algorithm, we can do this efficiently without needing to store every single visitor’s ID in memory.

We would hash each visitor’s ID to determine the bucket it belongs to and count the number of leading zeros. At the end of the day, we would calculate the harmonic mean of the estimates from all the buckets to get an estimate of the total number of unique visitors.

If you are interested in understanding more in detail, here is the original paper.

I have done a basic implementation of the hyperloglog algorithm which can be found at Git Hub.

The HyperLogLog algorithm is a powerful tool for cardinality estimation in big data scenarios. It provides a balance between accuracy and memory usage, making it possible to estimate the number of unique elements in extremely large datasets with reasonable accuracy and minimal memory footprint.

Tech knowledge

511 位关注者

要查看或添加评论，请登录

Manish Joshi的更多文章

From Crash to Recovery: The Power of the ARIES Algorithm

2024年6月3日

From Crash to Recovery: The Power of the ARIES Algorithm

The ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) algorithm is a powerful tool used in SQL database…
Database Locking Strategies: Balancing Act Between Speed and Reliability

2024年2月25日

Database Locking Strategies: Balancing Act Between Speed and Reliability

As a database engineer, one of our paramount concerns is the delicate balancing act between ensuring quick and…

1 条评论
Demystifying Bloom Filters: Simplifying the Complexity

2024年2月17日

Demystifying Bloom Filters: Simplifying the Complexity

Imagine you have a big pile of information, and you want to quickly figure out if a specific thing is in there or not…
The Art of System Design: A Journey with Jinja, Flask, and More!

2024年1月3日

The Art of System Design: A Journey with Jinja, Flask, and More!

How do you truly master system design? The answer lies in getting your hands dirty. While someone can introduce you to…
Best Practices for Creating Change Lists in Code Development

2023年12月22日

Best Practices for Creating Change Lists in Code Development

As a part of the infrastructure team, I often find myself writing a significant amount of code. This could involve…
Mastering Different Programming Languages with Ease

2023年11月24日

Mastering Different Programming Languages with Ease

In today's ever-changing world of software development, being good at different programming languages isn't just about…
From Mystery to Mastery: How Gradient Descent is Reshaping Our World

2023年9月2日

From Mystery to Mastery: How Gradient Descent is Reshaping Our World

Gradient descent, a crucial part of modern optimization and machine learning, has a history that goes back a long way…

1 条评论
Stop using std::stack and std::queue, try out the power of std::list

2023年8月17日

Stop using std::stack and std::queue, try out the power of std::list

The std::list container in CPP offers a versatile alternative that can replace not only stack, and queue but also…
Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

2023年8月14日

Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

In the previous article Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering, we embarked…

1 条评论
Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering

2023年8月6日

Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering

In the ever-evolving field of software engineering, search algorithms are indispensable tools for solving many…

2 条评论

See all articles

Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

Manish Joshi

Senior Software Engineer

The HyperLogLog Algorithm

How Does It Work?

领英推荐

Tech knowledge

511 位关注者

Manish Joshi的更多文章

社区洞察

其他会员也浏览了

From Data to Decisions: A Bird’s Eye View of Data Science

In a Radical Uncertainty world, be careful how we use data.

A simple guide to Cortex ML Functions: Anomaly Detection

Big Data: 4 Things You Can Do With It, And 3 Things You Can't

BIG DATA. Quality vs. Quantity

The Unseen Enemy: Unpacking the Curse of Dimensionality in Data Science

Bias in Data Analytics

Understanding Multicollinearity: A Guide for Data Science Enthusiasts

Mantras for Success in Data and AI first world:

Unlocking Insights with Graph Data Science (GDS)

The HyperLogLog Algorithm

How Does It Work?

领英推荐

Tech knowledge

511 位关注者

Manish Joshi的更多文章

From Crash to Recovery: The Power of the ARIES Algorithm

Database Locking Strategies: Balancing Act Between Speed and Reliability

Demystifying Bloom Filters: Simplifying the Complexity

The Art of System Design: A Journey with Jinja, Flask, and More!

Best Practices for Creating Change Lists in Code Development

Mastering Different Programming Languages with Ease

From Mystery to Mastery: How Gradient Descent is Reshaping Our World

Stop using std::stack and std::queue, try out the power of std::list

Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering

社区洞察

其他会员也浏览了

From Data to Decisions: A Bird’s Eye View of Data Science

In a Radical Uncertainty world, be careful how we use data.

A simple guide to Cortex ML Functions: Anomaly Detection

Big Data: 4 Things You Can Do With It, And 3 Things You Can't

BIG DATA. Quality vs. Quantity

The Unseen Enemy: Unpacking the Curse of Dimensionality in Data Science

Bias in Data Analytics

Understanding Multicollinearity: A Guide for Data Science Enthusiasts

Mantras for Success in Data and AI first world:

Unlocking Insights with Graph Data Science (GDS)