HyperLogLog Basics
Overview
Probabilistic data structures are very well-known because of their outstanding time and space complexity among traditional indexing algorithms. PDS get all these advantages because they rely on hash functions to generate random deterministic values for data. Some famous PDS are for example:
HyperLogLog is a probabilistic cardinality estimation algorithm designed for approximating the number of distinct elements in a multiset or the cardinality of a set with potential duplicates. It achieves high efficiency by using a fixed, small amount of memory regardless of the size of the input set. HyperLogLog employs hash functions to map elements to a fixed-size register, and its distinctive feature lies in the careful combination of these registers to estimate cardinality. Despite being probabilistic, HyperLogLog provides accurate approximations with relatively low memory requirements, making it particularly useful in applications where memory efficiency is an important element, such as in large-scale distributed systems and data streaming scenarios.
Structure
The HyperLogLog main composition is as next:
Structure
The HyperLogLog main composition is as next:
When we have the value of each one of the counts, Fajolet optimized the algorithm performance by using a harmonic mean instead of the geometric mean. The formula for calculating the harmonic mean is as next:
We may have the following sequence of values as the equation below. The harmonic mean will be able to process all the outliers of the overall data structure. Thus, having more accurate results for the approximation.
Next, is the formula used to calculate the cardinality of a dataset:
The official algorithm is shown in the next box from the original paper by Flajolet:
Applications
Some of the applications in real-world scenarios are mainly for databases or giant tech companies such as Google or Reddit.
Source Code [Python]
import mmh3
import math
class HyperLogLog:
def __init__(self, to_extract_bits=10, bit_number=32):
self.to_extract_bits = to_extract_bits
self.bit_number = bit_number
self.bucket_length = 2**to_extract_bits
# self.hll = [[] for _ in range(self.bucket_length)]
self.register = [0]*self.bucket_length
self.register_log = [[] for _ in range(self.bucket_length)]
self.alpha = 0.7213/(1+(1.079/self.bit_number))
def insert(self, hash_val):
hashed_bit_num = mmh3.hash_bytes(str(hash_val))
binary_sequence = ''.join(format(byte, '08b').replace('0', '0').replace('1', '1') for byte in hashed_bit_num)
idx = int(binary_sequence[:self.to_extract_bits], 2)
val = binary_sequence[self.to_extract_bits:self.bit_number]
zero_count = self._long_sequence_counter(val)
# self.hll[idx].append(val)
self.register_log[idx].append(zero_count)
if self.register[idx] < max(self.register_log[idx]):
self.register[idx] = zero_count
def estimate(self):
h_avg = len(self.register)/self._reciprocals_calculator(self.register)
hll_estimate = round(self.alpha*len(self.register)*h_avg, 2)
error_rate = abs(round(1.04/math.sqrt(len(self.register)), 2))
# print('HyperLogLog Estimate : ', hll_estimate)
# print('HyperLogLog Error Rate: ', error_rate,'%')
return hll_estimate
def count(self, hyperloglog):
register = []
register_log = []
for hash_list in hyperloglog:
temp_register = []
for hash_val in hash_list:
temp_register.append(self._long_sequence_counter(hash_val))
register_log.append(temp_register)
if len(temp_register) == 0:
register.append(0)
else:
register.append(max(temp_register))
def _long_sequence_counter(self, hash_val):
count = 0
for val in hash_val:
if val == '1':
if count == 0:
continue
break
if val == '0':
count += 1
return count
def _get_alpha(self, bit_number):
return (0.7213/(1+(1.079/bit_number)))
def _reciprocals_calculator(self, register):
if len(register) == 0:
return 0
else:
sum_vals = 1 / (2**register[0])
return sum_vals + self._reciprocals_calculator(register[1:])
References
#datascience #supercomputing hashtag#datastructures hashtag#algorithms hashtag#algorithms hashtag#algorithmmastery hashtag#probability