A Gentle Introduction to Probabilistic Data Structures
In contrast to traditional (deterministic) data structures, their probabilistic counterparts offer superior memory performance. However, this advantage often comes with the trade-off of having a specific error rate for queries. Herein, we discuss the basics of probabilistic data structures and explore the theory, motivation, and implementation of one of the simplest examples: Bloom filters.
History and Overview
Burton Howard Bloom invented the first probabilistic data structure in 1969 (published in 1970), widely known today as the Bloom filter. The American computer scientist was interested in more effectively using main memory and relying less on slow reads from bulk storage devices or databases. Bloom decided to analyze data structures in a different light than usual: rather than just concerning himself with time and space efficiency, he introduced the idea of “an allowable fraction of errors.”
Bloom looked specifically at the problem of set membership, which asks, “Given an element X and a collection of elements C, is X in C?” (for example, is a given word in the dictionary?) and returns an answer: yes or no. Bloom proposed a data structure – what we now call the Bloom filter – that offers a highly succinct (space-efficient) representation of such a collection. The caveat is, instead of getting “yes” or “no” as answers, we get “no” or “probably” (there are only false positives, no false negatives). So, in short, a Bloom filter is a probabilistic set where we can trade a degree of accuracy for memory conservation. The “classic” Bloom filter supports two fundamental operations: inserting an element into the filter and checking an element for membership.
Such a representation has proved to be especially useful in cases where memory is scarce compared to the size of the dataset, and the penalty of having an occasional false positive is not severe. Numerous early spell checkers used Bloom filters – it is far more efficient to store and search from a Bloom filter representation of a dictionary rather than perform a lookup on the dictionary proper. In such cases, Bloom filters offer significant performance advantages.
This is the heart of the matter with probabilistic data structures: acceptable inaccuracy. Since memory (RAM) is much faster than bulk storage (SSD, HDD, tape) – but much more scarce – we can achieve massive performance gains by making our data structures more succinct and usable in memory, using bulk storage only as needed.
In the fifty-five years since their inception, probabilistic data structures have become virtually ubiquitous in computer science, being used in databases, artificial intelligence, distributed computing, and big data analytics. Though Bloom filters offer highly space-efficient approximate membership querying, there are also improvements, such as the cuckoo filter, which, unlike the Bloom filter, supports deletion operations. In addition to membership querying, there are also data structures that support the memory-efficient approximation of:?
The canonically cited modern use of Bloom filters comes from content delivery network (CDN) giant Akamai Technologies in 2015. After analyzing a typical CDN cluster, engineers at Akamai noticed that 74% of the 400 million objects (web and video from various social media platforms) in the cache were only ever accessed once. They subsequently used Bloom filters to prevent caching of rarely visited websites – so-called “one-hit wonders” – and, in a test run on 47 production servers, reduced their disk writes by 44%, falling from 10209 writes per second to 5738 writes per second.
Due to this reduction in disk writes, the latency of reading from disk improved by 24% (15.6 ms to 11.9 ms) since there was less competition for disk access.
Basics: How Does a Bloom Filter Work?
Conceptually, the Bloom filter is relatively simple. At its core are two components: an addressable array of m bits and k different hash functions. Later, we will see how we can derive optimal numbers for m and k given specific parameters, such as a desired false positive rate, but for now, we will assume they are known in advance.
To construct a Bloom filter, we initialize a bit array of size m to all zeroes. In the example below, m = 12, and the array is indexed from 0 to m - 1.
Our Bloom filter will also have k independent, random hash functions. These hash functions should map uniformly over the range {0, …, m - 1}, as is typically expected of hash functions. It should be noted that these need not be cryptographic hash functions.
We will denote the k different hash functions with the following convention:
For the sake of demonstration, we will choose k = 3.
If we wish to add an element E to the Bloom filter, we must run the element through each hash function and set those bits to 1. For example, supposing that
After adding E to the Bloom filter, the bit array would now look different:
Now, for the sake of demonstration, let us add another element, Z, to the Bloom filter. Suppose now that
Notice that there was a collision at index 2. The bit at that index remains unchanged. The updated bit array now looks like this:
Querying for elements in the Bloom filter works in much the same way. If we wanted to check if E was in the Bloom filter, we would run the element through each hash function – and if every bit at those indexes is 1, then we know that E has probably been added to the filter.
Conversely, if we ask whether element Q, which we can suppose has
We know with certainty that Q has not been added to the Bloom filter. If it had been, the bits at indexes 10 and 1 could not possibly be 0.
An astute observer will notice a few things about this data structure. One is that, unlike a traditional set or hashmap structure, we can insert elements into the Bloom filter without altering its size in memory/storage. As elements are added, the bit array fills up, and the density of 1’s increases. Higher density increases the probability of false positives because more bits are set to 1, making it more likely that a query for a non-existent element will falsely match all k bit positions.
However, the relationship between the number of hash functions and the false positive rate is slightly more complicated. Initially, increasing k will reduce the false positive rate since more checks make it harder for a non-member to match all positions. But as k increases, each insertion affects more bits, thus increasing the density of 1’s and the false positive rate. We can intuitively make these observations, but later, we must turn to mathematics to calculate optimal numbers.
In practice, we typically specify at least two parameters when creating a Bloom filter: the desired false positive rate and the expected number of insertions. More recently, a variation of the Bloom filter, the Scalable Bloom Filter, was invented, which can dynamically adapt to the number of elements stored while assuring a maximum false positive probability. Regardless, for the sake of discussion, we will remain focused on the classical Bloom filter.
Elementary Analysis: Estimating Error Rates & Finding Optimal Values
Note: if you really hate math, feel free to skip this section. For the purposes of using and broadly understanding a Bloom filter, it's optional to consider all of the analytical details. You can just take the results as given.
Let us first reiterate two intuitive observations we were able to make:
Let’s explore observation one more analytically. Assume that the fraction of 1’s in the bit array is 1/10, and we are querying for a non-existent element. With k hash functions, each hash function has a 10% chance of giving us a cell in the bit array that, just by chance, has been set to 1. This means that the probability of a false positive, ε, is the density of 1’s raised to the power k:
To better understand the density of 1’s analytically, let’s assume we only have one hash function and a bit array of size m. After inserting one element, the probability that a random bit is set to 1 is 1/m. Equivalently, the probability that a given bit is set to 0 is
If we insert n elements into a Bloom filter where k = 1, the probability that a given bit is set to 0 becomes?
Rewriting this by multiplying the exponent by m/m, we get:
For algebraic purposes, the advantage of this is
has a well-known approximation. As long as m is large (which, in practice, it is), equation 1 is very close to 1/e, allowing us to approximate the density of 0’s as:
Put another way, if we have an array of 50 million bits, 10 hash functions, and we insert 5 million elements, we have m = 50 million and n = 50 million, giving us a 0’s density of:
Meaning that the 1’s density is 1 - 0.368 = 0.632. Hence, the probability of a false positive here is
Or about 1%.
More completely, we can express the approximate false positive rate as:
To minimize ε, we can optimize k by setting the derivative of ε with respect to k to zero and solving for the critical points. Omitting some tedious calculations and ignoring momentarily that the number of hash functions must be a positive integer, we find the optimal number for k is:
领英推荐
We can use this insight to now find an optimal size for the bit array. Given equation 2, we can now back-substitute the value for k we derived in equation 3,
Isolating m, we arrive at our answer:
Though it should be noted that while these equations are not strict bounds,?they are sufficiently accurate for practical purposes.
Since ε and n are specified by the programmer, we now know enough to construct a Bloom filter in its entirety. Let’s walk through an example mathematically, reviewing what we know. Suppose we are making a Bloom filter where we expect to insert 1 billion elements and want a false positive rate of 1%. That is, n = 1 billion and ε = 0.01. First, we can initialize a bit array of optimal length for this, substituting n and ε back into the equation above:
Which allows us to calculate the optimal number of hash functions:
Which we can round to 7.
Now, we can be reasonably sure that by initializing a bit array with 9.6 billion bits and using 7 independent, uniformly distributed hash functions, after 1 billion distinct insertions, we will maintain a false positive rate of 1% or less.
Implementation: A Bloom Filter in Python
Now that we conceptually understand Bloom filters, let's build one and see how it works in practice. For the sake of simplicity and performance in the demonstration, I will be using two third-party Python libraries: "bitarray" and "xxhash". However, if you wish to see a pure Python implementation with no external libraries used, I've written one here.
Before we write any code, we should consider the interface. We will represent a Bloom filter with a class. We know that when we create a Bloom filter, we specify two parameters: the desired false positive rate and the expected number of insertions. These, of course, will be passed into the constructor. At that point, we should allocate the proper number of bits in the bit array, figure out the optimal number of hash functions, and select/store those in some data structure.
The BloomFilter class will support two main operations: "insert" and "may_contain." Both of these operations expect one argument: the element to be added or queried.
Knowing this, let's write out the skeleton of our class:
class BloomFilter:
def __init__(self, expected_insertions, fp_rate):
self._expected_insertions = expected_insertions
self._fp_rate = fp_rate
self._bit_array = self._make_bit_array()
self._hash_functions = self._pick_hash_functions()
def _make_bit_array(self):
pass
def _pick_hash_functions(self):
pass
def insert(self, element):
pass
def may_contain(self, element):
pass
First, we can implement "_make_bit_array" using equation 3 from above to create a bit array of the correct size:
from math import ceil, log
from bitarray import bitarray
class BloomFilter:
def __init__(self, expected_insertions, fp_rate):
...
def _make_bit_array(self):
n = self._expected_insertions
p = self._fp_rate
length = ceil(-n * log(p) / log(2) ** 2)
return bitarray(length)
def _pick_hash_functions(self):
pass
def insert(self, element):
pass
def may_contain(self, element):
pass
This provides us with the required variables to determine the optimal number of hash functions to use:
from functools import partial
from math import ceil, log
from bitarray import bitarray
from xxhash import xxh3_64_intdigest
class BloomFilter:
def __init__(self, expected_insertions, fp_rate):
...
def _make_bit_array(self):
...
def _pick_hash_functions(self):
n = self._expected_insertions
m = len(self._bit_array)
k = ceil((m / n) * log(2)) # could also probably use round()
# Pick k hash functions
return [partial(xxh3_64_intdigest, seed=i) for i in range(k)]
def insert(self, element):
pass
def may_contain(self, element):
pass
Here, I use partial function application, which I've written an article about in the past, as well as a list comprehension, to create a list of reusable but independent hash functions.
At this point, we are ready to insert elements into the Bloom filter. This isn't too hard to do: just run the element through each hash function and set each affected bit to 1.
Since "xxh3_64_intdigest" expects bytes as an argument, I found it most convenient to first hash the inserted element into an integer, then call the "to_bytes" method to create a (most probably) unique byte signature for the element. "xxh3_64_intdigest" will return a random 64-bit integer, so we'll have to modulo that by the length of the bitarray to ensure we get a valid bucket in the bit array:
from functools import partial
from math import ceil, log
from bitarray import bitarray
from xxhash import xxh3_64_intdigest
class BloomFilter:
def __init__(self, expected_insertions, fp_rate):
...
def _make_bit_array(self):
...
def _pick_hash_functions(self):
...
def insert(self, element):
# Create a byte signature unique to the element
signature = hash(element).to_bytes(64, byteorder='big', signed=True)
for hasher in self._hash_functions:
# Find and set the proper slot in the bit array for the element
bucket = hasher(signature) % len(self._bit_array)
self._bit_array[bucket] = 1
def may_contain(self, element):
pass
The "may_contain" method is similar; we just need to ensure that every relevant bit is set to 1:
from functools import partial
from math import ceil, log
from bitarray import bitarray
from xxhash import xxh3_64_intdigest
class BloomFilter:
def __init__(self, expected_insertions, fp_rate):
...
def _make_bit_array(self):
...
def _pick_hash_functions(self):
...
def insert(self, element):
...
def may_contain(self, element):
# Create a byte signature unique to the element
signature = hash(element).to_bytes(64, byteorder='big', signed=True)
for hasher in self._hash_functions:
# Find the proper slot in the bit array for the element
bucket = hasher(signature) % len(self._bit_array)
# An already inserted element would necessarily have all positions set to 1
if self._bit_array[bucket] == 0:
return False
return True
Of course, one must take care to only pass hashable types into "insert" or "may_contain." At this point, we have completed the implementation of our basic Bloom filter!
Now, we can do operations like:
>>> bf = BloomFilter(expected_insertions=1_000_000, fp_rate=0.01)
>>> bf.insert('hello')
>>> bf.may_contain('hello')
True
>>> bf.may_contain('world')
False
>>> bf.insert('world')
>>> bf.may_contain('world')
True
If you benchmark it by adding UUIDs, this implementation is ~16 - 32 times more space efficient than an equivalently sized set (recall that the size of sets periodically doubles). For example, while a set with 7 million UUIDs added used 268.4 MB of heap space, the Bloom filter with a false positive rate of 1% used only 8.39 MB – a 32x improvement. This is in excellent agreement with the theoretical bound we derived, which predicts that a minimum of 8.3814 MB of space is required.
The speed performance, however, could likely be improved upon. In the same test, the Bloom filter took 11.9 times longer to insert and check the same number of elements. This is rather unsurprising. It will necessarily be slower than "set" since we are sequentially calling k + 1 hash functions. Readers may notice, however, that although the hash functions are processed serially, the ones in the loop are completely independent of each other, and more sophisticated implementations may take advantage of this.
For reference, here is the complete code I used for benchmarking. I also implemented "__sizeof__" for convenience: while it is not 100% accurate (we are storing a few extra constants after all), the deviation is negligible for large bit arrays.
from functools import partial
from math import ceil, log
from bitarray import bitarray
from xxhash import xxh3_64_intdigest
class BloomFilter:
def __init__(self, expected_insertions, fp_rate):
self._expected_insertions = expected_insertions
self._fp_rate = fp_rate
self._bit_array = self._make_bit_array()
self._hash_functions = self._pick_hash_functions()
def _make_bit_array(self):
n = self._expected_insertions
p = self._fp_rate
length = ceil(-n * log(p) / log(2) ** 2)
return bitarray(length)
def _pick_hash_functions(self):
n = self._expected_insertions
m = len(self._bit_array)
k = ceil((m / n) * log(2)) # could also probably use round()
# Pick k hash functions
return [partial(xxh3_64_intdigest, seed=i) for i in range(k)]
def insert(self, element):
signature = hash(element).to_bytes(64, byteorder='big', signed=True)
for hasher in self._hash_functions:
bucket = hasher(signature) % len(self._bit_array)
self._bit_array[bucket] = 1
def may_contain(self, element):
signature = hash(element).to_bytes(64, byteorder='big', signed=True)
for hasher in self._hash_functions:
bucket = hasher(signature) % len(self._bit_array)
if self._bit_array[bucket] == 0:
return False
return True
def __sizeof__(self):
return self._bit_array.__sizeof__()
if __name__ == "__main__":
from uuid import uuid4
from time import perf_counter
size = 7_000_000
uuids = [uuid4() for _ in range(size)]
start = perf_counter()
b = BloomFilter(size, fp_rate=0.01)
for uuid in uuids:
b.insert(uuid)
end = perf_counter()
print(round(end - start, 3), "seconds to add", size, "uuids")
bf_time = end - start
start = perf_counter()
s = set()
for uuid in uuids:
s.add(uuid)
end = perf_counter()
s_time = end - start
print(round(bf_time / s_time, 3), "times slower than set")
ssize = s.__sizeof__() / 1000 / 1000
bsize = b.__sizeof__() / 1000 / 1000
print(ssize, "MB used by set")
print(bsize, "MB used by Bloom filter")
print(round(ssize / bsize, 3), "times more space efficient")
Tested on Python 3.9.4, bitarray version 2.9.2, and xxhash version 3.4.1. I have a slightly more complete implementation written here, as well as some unit tests.
And finally, we should get some level of assurance that our false positive estimation is accurate. I'm sure there are many ways to do this, but I don't think anything can beat the simplicity of taking an experimental approach. Let's set that up by making a Bloom filter and fill it to capacity:
size = 200_000
input_fp_rate = 0.01
bloom = BloomFilter(size, input_fp_rate)
for i in range(size):
bloom.insert(i)
Now, we can check a large number of non-inserted elements for membership and measure the false positive rate. Ideally, this will be close to what we asked for when initializing the Bloom filter.
size = 200_000
input_fp_rate = 0.01
bloom = BloomFilter(size, input_fp_rate)
for i in range(size):
# Insert 0 through 199,999 into the Bloom filter
bloom.insert(i)
false_positives = 0
for i in range(size, size * 2 + 1):
# Check 200,000 through 400,000 for false positivity
if bloom.may_contain(i):
false_positives += 1
print(false_positives / size)
In my test, the false positive rate came out to be 0.99% — quite close to our input value of 1%. The results are reproducible with other numbers, hashable data types, and Bloom filter sizes.
Recap
We've covered quite a bit here, so let's review some crucial points:
I didn't want to do this too early on to avoid confusion, but it may be worth clarifying something now. You may see the term "probabilistic data structure" used two different ways out in the wild. One of them is the sense that we've already seen: where we work with a data structure that provides inexact answers to some extent. It is also sometimes used to describe data structures that do provide exact answers but whose internal mechanics rely on some probabilistic aspects (for example, a skip list).
Until next time!
References
Bloom, B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422–426. https://doi.org/10.1145/362686.362692.
Mitzenmacher, M. & Broder, A. (2004). Network Applications of Bloom Filters: A Survey. Internet Mathematics 1(4), 485–509. https://doi.org/10.1080/15427951.2004.10129096.
Clayton, D., Patton, C., & Shrimpton, T. (2019). Probabilistic Data Structures in Adversarial Environments. Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 1317–1334. https://doi.org/10.1145/3319535.3354235.
Tuning Bloom Filters | Apache Cassandra 3.x - DataStax Docs. https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsTuningBloomFilters.html.
Leskovec, J., Rajaraman, A., & Ullman, J. (2022). Mining of Massive Datasets. Cambridge University Press.
Fan, B., Andersen, D.,? Kaminsky, M., & Mitzenmacher, M. (2014). Cuckoo Filter: Practically Better Than Bloom. Proceedings of the 10th ACM International Conference on Emerging Networking Experiments and Technologies. Association for Computing Machinery, New York, NY, USA, 75–88. https://doi.org/10.1145/2674005.2674994.
Maggs, B. & Sitaraman R. (2015). Algorithmic Nuggets in Content Delivery. SIGCOMM Comput. Commun. Rev. 45, 3, 52–66. https://doi.org/10.1145/2805789.2805800.
Almeida, P.S., Baquero, C., Pregui?a, N.M., & Hutchison, D. (2007). Scalable Bloom Filters. Inf. Process. Lett., 101, 255-261. https://doi.org/10.1016/J.IPL.2006.10.007.
Mitzenmacher, M & Upfal, E. (2005). Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press.