登录查看更多内容

HyperLogLog Basics

Humberto Villalta

Python Developer | Data Engineer

发布日期: 2023年12月27日

Overview

Probabilistic data structures are very well-known because of their outstanding time and space complexity among traditional indexing algorithms. PDS get all these advantages because they rely on hash functions to generate random deterministic values for data. Some famous PDS are for example:

Bloom & Cuckoo Filters: For membership set confirmation applications
Count-Min Sketch: For estimating frequencies applications
HyperLogLog: For calculating cardinality (unique value count) applications“Some other cardinality estimation algorithms are also shown in the picture below”

HyperLogLog is a probabilistic cardinality estimation algorithm designed for approximating the number of distinct elements in a multiset or the cardinality of a set with potential duplicates. It achieves high efficiency by using a fixed, small amount of memory regardless of the size of the input set. HyperLogLog employs hash functions to map elements to a fixed-size register, and its distinctive feature lies in the careful combination of these registers to estimate cardinality. Despite being probabilistic, HyperLogLog provides accurate approximations with relatively low memory requirements, making it particularly useful in applications where memory efficiency is an important element, such as in large-scale distributed systems and data streaming scenarios.

Structure

The HyperLogLog main composition is as next:

(Single) Hash Function: Fajolet used a single hash function to avoid the expensive usage of the algorithm. The down part is that it must be used with a high-level encryption hash function such as SHA-256 or whatever.
Index From Value: This value is usually extracted from the hash value first “x” digits. This is used to index the values in the different buckets around the data structure.
Actual Value: This value is saved in the buckets and is mainly used to calculate the maximum “0” sequence for calculating the approximate cardinality of a large-scale dataset.
Buckets: The values generated from the hash functions are going to be saved in the buckets. This bucket number is usually selected by the “index from value” length. For example, if the first 2 bits of the hash value are selected, this means that the length of or buckets will be 2^2 = 4 buckets.

Structure

The HyperLogLog main composition is as next:

(Single) Hash Function: Fajolet used a single hash function to avoid the expensive usage of the algorithm. The down part is that it must be used with a high-level encryption hash function such as SHA-256 or whatever.
Index From Value: This value is usually extracted from the hash value first “x” digits. This is used to index the values in the different buckets around the data structure.
Actual Value: This value is saved in the buckets and is mainly used to calculate the maximum “0” sequence for calculating the approximate cardinality of a large-scale dataset.
Buckets: The values generated from the hash functions are going to be saved in the buckets. This bucket number is usually selected by the “index from value” length. For example, if the first 2 bits of the hash value are selected, this means that the length of or buckets will be 2^2 = 4 buckets.

When we have the value of each one of the counts, Fajolet optimized the algorithm performance by using a harmonic mean instead of the geometric mean. The formula for calculating the harmonic mean is as next:

We may have the following sequence of values as the equation below. The harmonic mean will be able to process all the outliers of the overall data structure. Thus, having more accurate results for the approximation.

Next, is the formula used to calculate the cardinality of a dataset:

The official algorithm is shown in the next box from the original paper by Flajolet:

Applications

Some of the applications in real-world scenarios are mainly for databases or giant tech companies such as Google or Reddit.

Redis: Redis supports the use of HyperLogLogs for their different user profiles. One example of how HyperLogLogs are used in Redis is for example the count of different plate numbers of busy roads in San Francisco to analyze the load of traffic. Redis is very well-known as a memory-resilient database. Meaning that the footprint of its elements has to be minimal. HyperLogLogs in Redis make a footprint of around 1MB.
Google: Google uses HyperLogLogs to calculate the number of distinct search queries being done globally. HyperLogLogs are super efficient because they allow Google to use if none very minimal amount of resources to make further analytics on the unique queries.
Reddit: Reddit uses HyperLogLog to monitor the number of views that an article or publication has by unique users. They use HyperLogLogs to calculate this because some users connect to different devices and the actual view statistics can be corrupted. Because of HyperLogLogs, this issue can be solved because every session is always to the same user name.

Source Code [Python]

import mmh3
import math

class HyperLogLog:
    def __init__(self, to_extract_bits=10, bit_number=32):
        self.to_extract_bits = to_extract_bits
        self.bit_number = bit_number
        self.bucket_length = 2**to_extract_bits
        # self.hll = [[] for _ in range(self.bucket_length)]
        self.register = [0]*self.bucket_length
        self.register_log = [[] for _ in range(self.bucket_length)]
        self.alpha =  0.7213/(1+(1.079/self.bit_number))

    def insert(self, hash_val):
        hashed_bit_num = mmh3.hash_bytes(str(hash_val))
        binary_sequence = ''.join(format(byte, '08b').replace('0', '0').replace('1', '1') for byte in hashed_bit_num)

        idx = int(binary_sequence[:self.to_extract_bits], 2)
        val = binary_sequence[self.to_extract_bits:self.bit_number]
        zero_count = self._long_sequence_counter(val)

        # self.hll[idx].append(val)
        self.register_log[idx].append(zero_count)
        if self.register[idx] < max(self.register_log[idx]):
            self.register[idx] = zero_count
        
    def estimate(self):
        h_avg = len(self.register)/self._reciprocals_calculator(self.register)
        hll_estimate = round(self.alpha*len(self.register)*h_avg, 2)
        error_rate = abs(round(1.04/math.sqrt(len(self.register)), 2))

        # print('HyperLogLog Estimate  : ', hll_estimate)
        # print('HyperLogLog Error Rate: ', error_rate,'%')
        
        return hll_estimate
    
    def count(self, hyperloglog):
        register = []
        register_log = []

        for hash_list in hyperloglog:
            temp_register = []
            for hash_val in hash_list:
                temp_register.append(self._long_sequence_counter(hash_val))
            register_log.append(temp_register)
            if len(temp_register) == 0:
                register.append(0)
            else:
                register.append(max(temp_register))

    def _long_sequence_counter(self, hash_val):
        count = 0
        for val in hash_val:
            if val == '1':
                if count == 0:
                    continue
                break
            if val == '0':
                count += 1
        return count
    
    def _get_alpha(self, bit_number):
        return (0.7213/(1+(1.079/bit_number)))

    def _reciprocals_calculator(self, register):
        if len(register) == 0:
            return 0
        else:
            sum_vals = 1 / (2**register[0])
            return sum_vals + self._reciprocals_calculator(register[1:])

References

#datascience #supercomputing hashtag#datastructures hashtag#algorithms hashtag#algorithms hashtag#algorithmmastery hashtag#probability

要查看或添加评论，请登录

Humberto Villalta的更多文章

Network Configuration

2024年2月13日

Network Configuration

Try the following steps to reset the server’s internet connection: Before Starting…. To see which file is connected to…
How to Push Local Repo Code into Github Repo

2024年2月8日

How to Push Local Repo Code into Github Repo

This guide is intended to show how to push code from our local computer to a GitHub repository. Step 1 Open the Git…
HDFS Clustering Through Docker in CentOS

2024年2月6日

HDFS Clustering Through Docker in CentOS

Overview This guide will show you how to deploy a distributed file system for Hadoop (HDFS). We will first make a…
Redis Integration with Python

2024年1月29日

Redis Integration with Python

Overview This guide is made for those who want to install Redis in a Centos 7 docker container and to integrate Python.…
What is Redis?

2024年1月24日

What is Redis?

Overview Redis is a C programming written REmote DIctionary Server developed in 2006. Redis read and write operations…
Linux [~/.bashrc] File

2024年1月18日

Linux [~/.bashrc] File

This post is intended to show how to interact with the different settings this file offers. Some of the settings that…
How to Set a Virtual Machine in Windows OS

2024年1月15日

How to Set a Virtual Machine in Windows OS

Overview Installing a virtual machine on a Windows operating system seems to be the easiest and most basic process that…

2 条评论
Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

2024年1月10日

Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

The advancement of IT technology has enabled people to project physical spaces into virtual spaces known as “digital…
Python Decorator Introduction with Examples

2024年1月3日

Python Decorator Introduction with Examples

1. Overview The decorator pattern is a software design pattern that allows us to dynamically add functionality to…
Cuckoo Index

2023年12月17日

Cuckoo Index

1. Overview Cuckoo hashing is used as a solution for hash collisions, and its worst-case lookup time is constant O(1).

See all articles

Overview

Structure

Structure

Applications

Source Code [Python]

References

Humberto Villalta的更多文章

Network Configuration

How to Push Local Repo Code into Github Repo

HDFS Clustering Through Docker in CentOS

Redis Integration with Python

What is Redis?

Linux [~/.bashrc] File

How to Set a Virtual Machine in Windows OS

Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

Python Decorator Introduction with Examples

Cuckoo Index