A Gentle Introduction to Probabilistic Data Structures

A Gentle Introduction to Probabilistic Data Structures

In contrast to traditional (deterministic) data structures, their probabilistic counterparts offer superior memory performance. However, this advantage often comes with the trade-off of having a specific error rate for queries. Herein, we discuss the basics of probabilistic data structures and explore the theory, motivation, and implementation of one of the simplest examples: Bloom filters.

History and Overview

Burton Howard Bloom invented the first probabilistic data structure in 1969 (published in 1970), widely known today as the Bloom filter. The American computer scientist was interested in more effectively using main memory and relying less on slow reads from bulk storage devices or databases. Bloom decided to analyze data structures in a different light than usual: rather than just concerning himself with time and space efficiency, he introduced the idea of “an allowable fraction of errors.”

Bloom looked specifically at the problem of set membership, which asks, “Given an element X and a collection of elements C, is X in C?” (for example, is a given word in the dictionary?) and returns an answer: yes or no. Bloom proposed a data structure – what we now call the Bloom filter – that offers a highly succinct (space-efficient) representation of such a collection. The caveat is, instead of getting “yes” or “no” as answers, we get “no” or “probably” (there are only false positives, no false negatives). So, in short, a Bloom filter is a probabilistic set where we can trade a degree of accuracy for memory conservation. The “classic” Bloom filter supports two fundamental operations: inserting an element into the filter and checking an element for membership.

Such a representation has proved to be especially useful in cases where memory is scarce compared to the size of the dataset, and the penalty of having an occasional false positive is not severe. Numerous early spell checkers used Bloom filters – it is far more efficient to store and search from a Bloom filter representation of a dictionary rather than perform a lookup on the dictionary proper. In such cases, Bloom filters offer significant performance advantages.

This is the heart of the matter with probabilistic data structures: acceptable inaccuracy. Since memory (RAM) is much faster than bulk storage (SSD, HDD, tape) – but much more scarce – we can achieve massive performance gains by making our data structures more succinct and usable in memory, using bulk storage only as needed.

In the fifty-five years since their inception, probabilistic data structures have become virtually ubiquitous in computer science, being used in databases, artificial intelligence, distributed computing, and big data analytics. Though Bloom filters offer highly space-efficient approximate membership querying, there are also improvements, such as the cuckoo filter, which, unlike the Bloom filter, supports deletion operations. In addition to membership querying, there are also data structures that support the memory-efficient approximation of:?

  • The cardinality (size) of sets/streams (e.g., HyperLogLog)
  • The similarity of sets (e.g., MinHash)
  • Frequency of occurrence in sets/streams (e.g., Count-min sketch)

The canonically cited modern use of Bloom filters comes from content delivery network (CDN) giant Akamai Technologies in 2015. After analyzing a typical CDN cluster, engineers at Akamai noticed that 74% of the 400 million objects (web and video from various social media platforms) in the cache were only ever accessed once. They subsequently used Bloom filters to prevent caching of rarely visited websites – so-called “one-hit wonders” – and, in a test run on 47 production servers, reduced their disk writes by 44%, falling from 10209 writes per second to 5738 writes per second.

A graph is shown depicting the relationship between date (x-axis) and disk writes per second (y-axis) between the dates of April 17th and May 28th.  Between March 14th and April 24th, a Bloom filter was turned on. During this time, disk writes fall from 10209 per second to 5738 per second.
Figure 1 - Turning on the Bloom filter nearly halves the rate of disk writes since "one-hit wonders" are not written to disk.

Due to this reduction in disk writes, the latency of reading from disk improved by 24% (15.6 ms to 11.9 ms) since there was less competition for disk access.

A graph is shown depicting the relationship between date (x-axis) and disk access latency in milliseconds (y-axis) between the dates of April 17th and May 28th.  Between March 14th and April 24th, a Bloom filter was turned on. During this time, average latency falls from 15.6 milliseconds on average to 11.9 milliseconds.
Figure 2 - Leaving the disk more open for popular objects results in better access time to those objects.


Basics: How Does a Bloom Filter Work?

Conceptually, the Bloom filter is relatively simple. At its core are two components: an addressable array of m bits and k different hash functions. Later, we will see how we can derive optimal numbers for m and k given specific parameters, such as a desired false positive rate, but for now, we will assume they are known in advance.

To construct a Bloom filter, we initialize a bit array of size m to all zeroes. In the example below, m = 12, and the array is indexed from 0 to m - 1.

A diagram of an empty bit array (i.e., initialized to all zeroes) with 12 slots is shown. The first slot is index 0 and the last slot is index m - 1.

Our Bloom filter will also have k independent, random hash functions. These hash functions should map uniformly over the range {0, …, m - 1}, as is typically expected of hash functions. It should be noted that these need not be cryptographic hash functions.

We will denote the k different hash functions with the following convention:

h_1, ..., h_k

For the sake of demonstration, we will choose k = 3.

A diagram depicting the Bloom filter at this point in time. We see that it consists of 3 hash functions and 12 bits all initialized to 0.

If we wish to add an element E to the Bloom filter, we must run the element through each hash function and set those bits to 1. For example, supposing that

 h_1(E) = 2, h_2(E) = 5, and h_3(E) = 7

After adding E to the Bloom filter, the bit array would now look different:

The bits at index 2, 5, and 7 are now set to 1 in the Bloom filter.

Now, for the sake of demonstration, let us add another element, Z, to the Bloom filter. Suppose now that

h_1(Z) = 8, h_2(Z) = 0, and h_3(Z) = 2

Notice that there was a collision at index 2. The bit at that index remains unchanged. The updated bit array now looks like this:

The bits at indexes 0, 2, 5, 7, and 8 are now set to 1.

Querying for elements in the Bloom filter works in much the same way. If we wanted to check if E was in the Bloom filter, we would run the element through each hash function – and if every bit at those indexes is 1, then we know that E has probably been added to the filter.

Since the bits at indexes 2, 5, and 7 are set to 1, we say that E has probably been added to the Bloom filter.

Conversely, if we ask whether element Q, which we can suppose has

h_1(Q) = 2, h_2(Q) = 1, and h_3(Q) = 10

We know with certainty that Q has not been added to the Bloom filter. If it had been, the bits at indexes 10 and 1 could not possibly be 0.

Since the bits at 1 and 10 are set to 0, we know it's impossible that Q could have been added to the Bloom filter.

An astute observer will notice a few things about this data structure. One is that, unlike a traditional set or hashmap structure, we can insert elements into the Bloom filter without altering its size in memory/storage. As elements are added, the bit array fills up, and the density of 1’s increases. Higher density increases the probability of false positives because more bits are set to 1, making it more likely that a query for a non-existent element will falsely match all k bit positions.

However, the relationship between the number of hash functions and the false positive rate is slightly more complicated. Initially, increasing k will reduce the false positive rate since more checks make it harder for a non-member to match all positions. But as k increases, each insertion affects more bits, thus increasing the density of 1’s and the false positive rate. We can intuitively make these observations, but later, we must turn to mathematics to calculate optimal numbers.

In practice, we typically specify at least two parameters when creating a Bloom filter: the desired false positive rate and the expected number of insertions. More recently, a variation of the Bloom filter, the Scalable Bloom Filter, was invented, which can dynamically adapt to the number of elements stored while assuring a maximum false positive probability. Regardless, for the sake of discussion, we will remain focused on the classical Bloom filter.


Elementary Analysis: Estimating Error Rates & Finding Optimal Values

Note: if you really hate math, feel free to skip this section. For the purposes of using and broadly understanding a Bloom filter, it's optional to consider all of the analytical details. You can just take the results as given.

Let us first reiterate two intuitive observations we were able to make:

  1. The probability of false positives depends on the density of 1’s in the array and the number of hash functions.
  2. The number of 1’s is approximately the number of elements inserted times the number of hash functions. Collisions lower this number slightly.

Let’s explore observation one more analytically. Assume that the fraction of 1’s in the bit array is 1/10, and we are querying for a non-existent element. With k hash functions, each hash function has a 10% chance of giving us a cell in the bit array that, just by chance, has been set to 1. This means that the probability of a false positive, ε, is the density of 1’s raised to the power k:

To better understand the density of 1’s analytically, let’s assume we only have one hash function and a bit array of size m. After inserting one element, the probability that a random bit is set to 1 is 1/m. Equivalently, the probability that a given bit is set to 0 is

One minus one over m

If we insert n elements into a Bloom filter where k = 1, the probability that a given bit is set to 0 becomes?

One minus one over m all raised to the nth power

Rewriting this by multiplying the exponent by m/m, we get:

One minus one over m all raised to the m times the ratio of n to m

For algebraic purposes, the advantage of this is

One minus one over m all raised to the mth power
Equation 1

has a well-known approximation. As long as m is large (which, in practice, it is), equation 1 is very close to 1/e, allowing us to approximate the density of 0’s as:

e raised to the negative n over m

Put another way, if we have an array of 50 million bits, 10 hash functions, and we insert 5 million elements, we have m = 50 million and n = 50 million, giving us a 0’s density of:

One over e = 0.368

Meaning that the 1’s density is 1 - 0.368 = 0.632. Hence, the probability of a false positive here is

0.632 to the 10th power = 0.0101

Or about 1%.

More completely, we can express the approximate false positive rate as:

epsilon approximately equals one minus e to the power negative k * n / m all raised to the kth power
Equation 2

To minimize ε, we can optimize k by setting the derivative of ε with respect to k to zero and solving for the critical points. Omitting some tedious calculations and ignoring momentarily that the number of hash functions must be a positive integer, we find the optimal number for k is:

k equals the product of the natural logarithm of 2 and the ratio of m to n
Equation 3

We can use this insight to now find an optimal size for the bit array. Given equation 2, we can now back-substitute the value for k we derived in equation 3,

Isolating m, we arrive at our answer:

m equals negative n times the natural logarithm of epsilon all over the square of the natural logarithm of two
Equation 4

Though it should be noted that while these equations are not strict bounds,?they are sufficiently accurate for practical purposes.




Since ε and n are specified by the programmer, we now know enough to construct a Bloom filter in its entirety. Let’s walk through an example mathematically, reviewing what we know. Suppose we are making a Bloom filter where we expect to insert 1 billion elements and want a false positive rate of 1%. That is, n = 1 billion and ε = 0.01. First, we can initialize a bit array of optimal length for this, substituting n and ε back into the equation above:

The optimal number of bits turns out to be around 9.6 billion

Which allows us to calculate the optimal number of hash functions:

Which we can round to 7.

Now, we can be reasonably sure that by initializing a bit array with 9.6 billion bits and using 7 independent, uniformly distributed hash functions, after 1 billion distinct insertions, we will maintain a false positive rate of 1% or less.


Implementation: A Bloom Filter in Python

Now that we conceptually understand Bloom filters, let's build one and see how it works in practice. For the sake of simplicity and performance in the demonstration, I will be using two third-party Python libraries: "bitarray" and "xxhash". However, if you wish to see a pure Python implementation with no external libraries used, I've written one here.

Before we write any code, we should consider the interface. We will represent a Bloom filter with a class. We know that when we create a Bloom filter, we specify two parameters: the desired false positive rate and the expected number of insertions. These, of course, will be passed into the constructor. At that point, we should allocate the proper number of bits in the bit array, figure out the optimal number of hash functions, and select/store those in some data structure.

The BloomFilter class will support two main operations: "insert" and "may_contain." Both of these operations expect one argument: the element to be added or queried.

Knowing this, let's write out the skeleton of our class:

class BloomFilter:
    def __init__(self, expected_insertions, fp_rate):
        self._expected_insertions = expected_insertions
        self._fp_rate = fp_rate
        self._bit_array = self._make_bit_array()
        self._hash_functions = self._pick_hash_functions()

    def _make_bit_array(self):
        pass

    def _pick_hash_functions(self):
        pass

    def insert(self, element):
       pass

    def may_contain(self, element):
       pass        

First, we can implement "_make_bit_array" using equation 3 from above to create a bit array of the correct size:

from math import ceil, log
from bitarray import bitarray


class BloomFilter:
    def __init__(self, expected_insertions, fp_rate):
        ...

    def _make_bit_array(self):
        n = self._expected_insertions
        p = self._fp_rate
        length = ceil(-n * log(p) / log(2) ** 2)
        return bitarray(length)

    def _pick_hash_functions(self):
        pass

    def insert(self, element):
       pass

    def may_contain(self, element):
       pass        

This provides us with the required variables to determine the optimal number of hash functions to use:

from functools import partial
from math import ceil, log
from bitarray import bitarray
from xxhash import xxh3_64_intdigest


class BloomFilter:
    def __init__(self, expected_insertions, fp_rate):
        ...

    def _make_bit_array(self):
        ...

    def _pick_hash_functions(self):
        n = self._expected_insertions
        m = len(self._bit_array)
        k = ceil((m / n) * log(2))   # could also probably use round()
        # Pick k hash functions
        return [partial(xxh3_64_intdigest, seed=i) for i in range(k)]

    def insert(self, element):
       pass

    def may_contain(self, element):
       pass        

Here, I use partial function application, which I've written an article about in the past, as well as a list comprehension, to create a list of reusable but independent hash functions.

At this point, we are ready to insert elements into the Bloom filter. This isn't too hard to do: just run the element through each hash function and set each affected bit to 1.

Since "xxh3_64_intdigest" expects bytes as an argument, I found it most convenient to first hash the inserted element into an integer, then call the "to_bytes" method to create a (most probably) unique byte signature for the element. "xxh3_64_intdigest" will return a random 64-bit integer, so we'll have to modulo that by the length of the bitarray to ensure we get a valid bucket in the bit array:

from functools import partial
from math import ceil, log
from bitarray import bitarray
from xxhash import xxh3_64_intdigest


class BloomFilter:
    def __init__(self, expected_insertions, fp_rate):
        ...

    def _make_bit_array(self):
        ...

    def _pick_hash_functions(self):
        ...

    def insert(self, element):
        # Create a byte signature unique to the element
        signature = hash(element).to_bytes(64, byteorder='big', signed=True)

        for hasher in self._hash_functions:
            # Find and set the proper slot in the bit array for the element
            bucket = hasher(signature) % len(self._bit_array)
            self._bit_array[bucket] = 1

    def may_contain(self, element):
       pass        

The "may_contain" method is similar; we just need to ensure that every relevant bit is set to 1:

from functools import partial
from math import ceil, log
from bitarray import bitarray
from xxhash import xxh3_64_intdigest


class BloomFilter:
    def __init__(self, expected_insertions, fp_rate):
        ...

    def _make_bit_array(self):
        ...

    def _pick_hash_functions(self):
        ...

    def insert(self, element):
        ...

    def may_contain(self, element):
        # Create a byte signature unique to the element
        signature = hash(element).to_bytes(64, byteorder='big', signed=True)

        for hasher in self._hash_functions:
            # Find the proper slot in the bit array for the element
            bucket = hasher(signature) % len(self._bit_array)

            # An already inserted element would necessarily have all positions set to 1
            if self._bit_array[bucket] == 0:
                return False

        return True        

Of course, one must take care to only pass hashable types into "insert" or "may_contain." At this point, we have completed the implementation of our basic Bloom filter!

Now, we can do operations like:

>>> bf = BloomFilter(expected_insertions=1_000_000, fp_rate=0.01)
>>> bf.insert('hello')
>>> bf.may_contain('hello')
True
>>> bf.may_contain('world')
False
>>> bf.insert('world')
>>> bf.may_contain('world')
True        

If you benchmark it by adding UUIDs, this implementation is ~16 - 32 times more space efficient than an equivalently sized set (recall that the size of sets periodically doubles). For example, while a set with 7 million UUIDs added used 268.4 MB of heap space, the Bloom filter with a false positive rate of 1% used only 8.39 MB – a 32x improvement. This is in excellent agreement with the theoretical bound we derived, which predicts that a minimum of 8.3814 MB of space is required.

The speed performance, however, could likely be improved upon. In the same test, the Bloom filter took 11.9 times longer to insert and check the same number of elements. This is rather unsurprising. It will necessarily be slower than "set" since we are sequentially calling k + 1 hash functions. Readers may notice, however, that although the hash functions are processed serially, the ones in the loop are completely independent of each other, and more sophisticated implementations may take advantage of this.

For reference, here is the complete code I used for benchmarking. I also implemented "__sizeof__" for convenience: while it is not 100% accurate (we are storing a few extra constants after all), the deviation is negligible for large bit arrays.

from functools import partial
from math import ceil, log
from bitarray import bitarray
from xxhash import xxh3_64_intdigest


class BloomFilter:
    def __init__(self, expected_insertions, fp_rate):
        self._expected_insertions = expected_insertions
        self._fp_rate = fp_rate
        self._bit_array = self._make_bit_array()
        self._hash_functions = self._pick_hash_functions()

    def _make_bit_array(self):
        n = self._expected_insertions
        p = self._fp_rate
        length = ceil(-n * log(p) / log(2) ** 2)
        return bitarray(length)

    def _pick_hash_functions(self):
        n = self._expected_insertions
        m = len(self._bit_array)
        k = ceil((m / n) * log(2))   # could also probably use round()
        # Pick k hash functions
        return [partial(xxh3_64_intdigest, seed=i) for i in range(k)]

    def insert(self, element):
         signature = hash(element).to_bytes(64, byteorder='big', signed=True)

        for hasher in self._hash_functions:
            bucket = hasher(signature) % len(self._bit_array)
            self._bit_array[bucket] = 1

    def may_contain(self, element):
        signature = hash(element).to_bytes(64, byteorder='big', signed=True)

        for hasher in self._hash_functions:
            bucket = hasher(signature) % len(self._bit_array)

            if self._bit_array[bucket] == 0:
                return False

        return True

    def __sizeof__(self):
        return self._bit_array.__sizeof__()


if __name__ == "__main__":
    from uuid import uuid4
    from time import perf_counter

    size = 7_000_000
    uuids = [uuid4() for _ in range(size)]
    start = perf_counter()

    b = BloomFilter(size, fp_rate=0.01)

    for uuid in uuids:
        b.insert(uuid)

    end = perf_counter()

    print(round(end - start, 3), "seconds to add", size, "uuids")
    bf_time = end - start

    start = perf_counter()

    s = set()
    for uuid in uuids:
        s.add(uuid)

    end = perf_counter()
    s_time = end - start

    print(round(bf_time / s_time, 3), "times slower than set")
    ssize = s.__sizeof__() / 1000 / 1000
    bsize = b.__sizeof__() / 1000 / 1000
    print(ssize, "MB used by set")
    print(bsize, "MB used by Bloom filter")
    print(round(ssize / bsize, 3), "times more space efficient")        

Tested on Python 3.9.4, bitarray version 2.9.2, and xxhash version 3.4.1. I have a slightly more complete implementation written here, as well as some unit tests.

And finally, we should get some level of assurance that our false positive estimation is accurate. I'm sure there are many ways to do this, but I don't think anything can beat the simplicity of taking an experimental approach. Let's set that up by making a Bloom filter and fill it to capacity:

size = 200_000
input_fp_rate = 0.01
bloom = BloomFilter(size, input_fp_rate)

for i in range(size):
    bloom.insert(i)        

Now, we can check a large number of non-inserted elements for membership and measure the false positive rate. Ideally, this will be close to what we asked for when initializing the Bloom filter.

size = 200_000
input_fp_rate = 0.01
bloom = BloomFilter(size, input_fp_rate)

for i in range(size):
    # Insert 0 through 199,999 into the Bloom filter
    bloom.insert(i)

false_positives = 0

for i in range(size, size * 2 + 1):
    # Check 200,000 through 400,000 for false positivity
    if bloom.may_contain(i):
        false_positives += 1

print(false_positives / size)        

In my test, the false positive rate came out to be 0.99% — quite close to our input value of 1%. The results are reproducible with other numbers, hashable data types, and Bloom filter sizes.


Recap

We've covered quite a bit here, so let's review some crucial points:

  1. Though incredibly useful, the Bloom filter is just one of many probabilistic data structures. They have been applied to an enormous variety of problems and can be found in essentially any data-intensive computer science application.
  2. Many probabilistic data structures use the same underlying principles we have seen here, using hash functions to elegantly summarize data while introducing some randomness. For example, see count-min sketch, HyperLogLog, or MinHash.
  3. The interfaces for probabilistic data structures are often straightforward and very similar to those we already use—their formal analysis, however, is slightly less so.

I didn't want to do this too early on to avoid confusion, but it may be worth clarifying something now. You may see the term "probabilistic data structure" used two different ways out in the wild. One of them is the sense that we've already seen: where we work with a data structure that provides inexact answers to some extent. It is also sometimes used to describe data structures that do provide exact answers but whose internal mechanics rely on some probabilistic aspects (for example, a skip list).

Until next time!


References

Bloom, B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422–426. https://doi.org/10.1145/362686.362692.

Mitzenmacher, M. & Broder, A. (2004). Network Applications of Bloom Filters: A Survey. Internet Mathematics 1(4), 485–509. https://doi.org/10.1080/15427951.2004.10129096.

Clayton, D., Patton, C., & Shrimpton, T. (2019). Probabilistic Data Structures in Adversarial Environments. Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 1317–1334. https://doi.org/10.1145/3319535.3354235.

Tuning Bloom Filters | Apache Cassandra 3.x - DataStax Docs. https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsTuningBloomFilters.html.

Leskovec, J., Rajaraman, A., & Ullman, J. (2022). Mining of Massive Datasets. Cambridge University Press.

Fan, B., Andersen, D.,? Kaminsky, M., & Mitzenmacher, M. (2014). Cuckoo Filter: Practically Better Than Bloom. Proceedings of the 10th ACM International Conference on Emerging Networking Experiments and Technologies. Association for Computing Machinery, New York, NY, USA, 75–88. https://doi.org/10.1145/2674005.2674994.

Maggs, B. & Sitaraman R. (2015). Algorithmic Nuggets in Content Delivery. SIGCOMM Comput. Commun. Rev. 45, 3, 52–66. https://doi.org/10.1145/2805789.2805800.

Almeida, P.S., Baquero, C., Pregui?a, N.M., & Hutchison, D. (2007). Scalable Bloom Filters. Inf. Process. Lett., 101, 255-261. https://doi.org/10.1016/J.IPL.2006.10.007.

Mitzenmacher, M & Upfal, E. (2005). Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press.

要查看或添加评论,请登录

Cameron Beck的更多文章

  • Partial Function Application in Software Design

    Partial Function Application in Software Design

    Summary Partial function application is a straightforward – but frequently misunderstood – technique for controlling…

  • Authentication & Emails in Flask

    Authentication & Emails in Flask

    Since graduating and embarking on the job search, I’ve had a lot of time to iterate over Smack. I’ve added new features…

    7 条评论

社区洞察

其他会员也浏览了