HyperLogLog Basics

HyperLogLog Basics

Overview

Probabilistic data structures are very well-known because of their outstanding time and space complexity among traditional indexing algorithms. PDS get all these advantages because they rely on hash functions to generate random deterministic values for data. Some famous PDS are for example:

  • Bloom & Cuckoo Filters: For membership set confirmation applications
  • Count-Min Sketch: For estimating frequencies applications
  • HyperLogLog: For calculating cardinality (unique value count) applications“Some other cardinality estimation algorithms are also shown in the picture below”

HyperLogLog is a probabilistic cardinality estimation algorithm designed for approximating the number of distinct elements in a multiset or the cardinality of a set with potential duplicates. It achieves high efficiency by using a fixed, small amount of memory regardless of the size of the input set. HyperLogLog employs hash functions to map elements to a fixed-size register, and its distinctive feature lies in the careful combination of these registers to estimate cardinality. Despite being probabilistic, HyperLogLog provides accurate approximations with relatively low memory requirements, making it particularly useful in applications where memory efficiency is an important element, such as in large-scale distributed systems and data streaming scenarios.

Structure

The HyperLogLog main composition is as next:

  • (Single) Hash Function: Fajolet used a single hash function to avoid the expensive usage of the algorithm. The down part is that it must be used with a high-level encryption hash function such as SHA-256 or whatever.
  • Index From Value: This value is usually extracted from the hash value first “x” digits. This is used to index the values in the different buckets around the data structure.
  • Actual Value: This value is saved in the buckets and is mainly used to calculate the maximum “0” sequence for calculating the approximate cardinality of a large-scale dataset.
  • Buckets: The values generated from the hash functions are going to be saved in the buckets. This bucket number is usually selected by the “index from value” length. For example, if the first 2 bits of the hash value are selected, this means that the length of or buckets will be 2^2 = 4 buckets.

Structure

The HyperLogLog main composition is as next:

  • (Single) Hash Function: Fajolet used a single hash function to avoid the expensive usage of the algorithm. The down part is that it must be used with a high-level encryption hash function such as SHA-256 or whatever.
  • Index From Value: This value is usually extracted from the hash value first “x” digits. This is used to index the values in the different buckets around the data structure.
  • Actual Value: This value is saved in the buckets and is mainly used to calculate the maximum “0” sequence for calculating the approximate cardinality of a large-scale dataset.
  • Buckets: The values generated from the hash functions are going to be saved in the buckets. This bucket number is usually selected by the “index from value” length. For example, if the first 2 bits of the hash value are selected, this means that the length of or buckets will be 2^2 = 4 buckets.

When we have the value of each one of the counts, Fajolet optimized the algorithm performance by using a harmonic mean instead of the geometric mean. The formula for calculating the harmonic mean is as next:

We may have the following sequence of values as the equation below. The harmonic mean will be able to process all the outliers of the overall data structure. Thus, having more accurate results for the approximation.

Next, is the formula used to calculate the cardinality of a dataset:

The official algorithm is shown in the next box from the original paper by Flajolet:

Applications

Some of the applications in real-world scenarios are mainly for databases or giant tech companies such as Google or Reddit.

  • Redis: Redis supports the use of HyperLogLogs for their different user profiles. One example of how HyperLogLogs are used in Redis is for example the count of different plate numbers of busy roads in San Francisco to analyze the load of traffic. Redis is very well-known as a memory-resilient database. Meaning that the footprint of its elements has to be minimal. HyperLogLogs in Redis make a footprint of around 1MB.
  • Google: Google uses HyperLogLogs to calculate the number of distinct search queries being done globally. HyperLogLogs are super efficient because they allow Google to use if none very minimal amount of resources to make further analytics on the unique queries.
  • Reddit: Reddit uses HyperLogLog to monitor the number of views that an article or publication has by unique users. They use HyperLogLogs to calculate this because some users connect to different devices and the actual view statistics can be corrupted. Because of HyperLogLogs, this issue can be solved because every session is always to the same user name.

Source Code [Python]

import mmh3
import math

class HyperLogLog:
    def __init__(self, to_extract_bits=10, bit_number=32):
        self.to_extract_bits = to_extract_bits
        self.bit_number = bit_number
        self.bucket_length = 2**to_extract_bits
        # self.hll = [[] for _ in range(self.bucket_length)]
        self.register = [0]*self.bucket_length
        self.register_log = [[] for _ in range(self.bucket_length)]
        self.alpha =  0.7213/(1+(1.079/self.bit_number))

    def insert(self, hash_val):
        hashed_bit_num = mmh3.hash_bytes(str(hash_val))
        binary_sequence = ''.join(format(byte, '08b').replace('0', '0').replace('1', '1') for byte in hashed_bit_num)

        idx = int(binary_sequence[:self.to_extract_bits], 2)
        val = binary_sequence[self.to_extract_bits:self.bit_number]
        zero_count = self._long_sequence_counter(val)

        # self.hll[idx].append(val)
        self.register_log[idx].append(zero_count)
        if self.register[idx] < max(self.register_log[idx]):
            self.register[idx] = zero_count
        
    def estimate(self):
        h_avg = len(self.register)/self._reciprocals_calculator(self.register)
        hll_estimate = round(self.alpha*len(self.register)*h_avg, 2)
        error_rate = abs(round(1.04/math.sqrt(len(self.register)), 2))

        # print('HyperLogLog Estimate  : ', hll_estimate)
        # print('HyperLogLog Error Rate: ', error_rate,'%')
        
        return hll_estimate
    
    def count(self, hyperloglog):
        register = []
        register_log = []

        for hash_list in hyperloglog:
            temp_register = []
            for hash_val in hash_list:
                temp_register.append(self._long_sequence_counter(hash_val))
            register_log.append(temp_register)
            if len(temp_register) == 0:
                register.append(0)
            else:
                register.append(max(temp_register))

    def _long_sequence_counter(self, hash_val):
        count = 0
        for val in hash_val:
            if val == '1':
                if count == 0:
                    continue
                break
            if val == '0':
                count += 1
        return count
    
    def _get_alpha(self, bit_number):
        return (0.7213/(1+(1.079/bit_number)))

    def _reciprocals_calculator(self, register):
        if len(register) == 0:
            return 0
        else:
            sum_vals = 1 / (2**register[0])
            return sum_vals + self._reciprocals_calculator(register[1:])
        

References

  1. https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/
  2. https://www.youtube.com/watch?v=MunL8nnwscQ&ab_channel=Redis
  3. https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869
  4. https://towardsdatascience.com/big-data-with-sketchy-structures-part-2-hyperloglog-and-bloom-filters-73b1c4a2e6ad
  5. https://www.youtube.com/watch?v=2PlrMCiUN_s&ab_channel=VictorSanchesPortella
  6. https://pangaj.github.io/HyperLogLog/
  7. https://www.yld.io/blog/hyperloglog-a-probabilistic-data-structure/
  8. https://odino.org/my-favorite-data-structure-hyperloglog/
  9. https://druid.apache.org/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html
  10. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf [official paper]
  11. https://redis.io/docs/data-types/probabilistic/hyperloglogs/
  12. https://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation
  13. https://www.dhirubhai.net/pulse/understanding-hyperloglog-utkarsh-upendra/


#datascience #supercomputing hashtag#datastructures hashtag#algorithms hashtag#algorithms hashtag#algorithmmastery hashtag#probability







要查看或添加评论,请登录

Humberto Villalta的更多文章

  • Network Configuration

    Network Configuration

    Try the following steps to reset the server’s internet connection: Before Starting…. To see which file is connected to…

  • How to Push Local Repo Code into Github Repo

    How to Push Local Repo Code into Github Repo

    This guide is intended to show how to push code from our local computer to a GitHub repository. Step 1 Open the Git…

  • HDFS Clustering Through Docker in CentOS

    HDFS Clustering Through Docker in CentOS

    Overview This guide will show you how to deploy a distributed file system for Hadoop (HDFS). We will first make a…

  • Redis Integration with Python

    Redis Integration with Python

    Overview This guide is made for those who want to install Redis in a Centos 7 docker container and to integrate Python.…

  • What is Redis?

    What is Redis?

    Overview Redis is a C programming written REmote DIctionary Server developed in 2006. Redis read and write operations…

  • Linux [~/.bashrc] File

    Linux [~/.bashrc] File

    This post is intended to show how to interact with the different settings this file offers. Some of the settings that…

  • How to Set a Virtual Machine in Windows OS

    How to Set a Virtual Machine in Windows OS

    Overview Installing a virtual machine on a Windows operating system seems to be the easiest and most basic process that…

    2 条评论
  • Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

    Spacetime Data Hub Technology Connecting the Physical and Digital Worlds

    The advancement of IT technology has enabled people to project physical spaces into virtual spaces known as “digital…

  • Python Decorator Introduction with Examples

    Python Decorator Introduction with Examples

    1. Overview The decorator pattern is a software design pattern that allows us to dynamically add functionality to…

  • Cuckoo Index

    Cuckoo Index

    1. Overview Cuckoo hashing is used as a solution for hash collisions, and its worst-case lookup time is constant O(1).