Counting Bloom Filter

Counting Bloom Filter

Overview:

As developers, we are constantly on the lookout for data structures and algorithms that can enhance the speed and efficiency of our applications. One such powerful tool is the Counting Bloom Filter. Counting Bloom Filter is a probabilistic data structure [PDS] used for testing whether an element is a member of a set. It is a variation of the standard Bloom filter that allows elements to be added and removed from the set. This will be helpful to check whether an element is present or not in probable among very large data. Ex: Evaluating if a user’s login ID is new or already available with lookup of existing users in Gmail DB and also the number of occurences the user logged in etc…

Counting Bloom Filters are used when there’s a need for a memory-efficient structure that can quickly check whether an element is in a set. They’re especially useful when dealing with large data sets because they significantly reduce the need for disk or network I/O operations. The counting aspect of the Bloom filter allows for the implementation of delete operations, which is not possible in a traditional Bloom filter.

Counting Bloom Filters are incredibly helpful for managing large datasets efficiently due to their memory-efficiency and quick search capabilities. They can significantly minimize the need for disk or network I/O operations, leading to improved performance. The ‘counting’ feature of this data structure allows for removal of elements, which isn’t possible with a standard Bloom filter. This makes it useful in scenarios where the dataset is dynamic and elements need to be deleted over time.

Counting Bloom Filters use the same functions of Bloom Filters, but the main difference is that in Counting Bloom Filters, we add counters in every slot instead of (0 or 1). That means we can delete elements from the data structure which is not possible in Bloom Filters.

Why ?

A Counting Bloom filter offers several advantages, making it a good choice in certain scenarios:

Space Efficiency

Counting Bloom filters are space-efficient data structures, making them suitable for use in systems with space constraints.

Deletion Capability

Unlike standard Bloom filters, Counting Bloom filters allow for deletions, making them useful in dynamic scenarios where items need to be added and removed frequently.

Fast Operations

Counting Bloom filters offer constant time complexity for operations like insertions, deletions, and lookups, which is particularly advantageous when dealing with large datasets.

Network Friendly

They are network-friendly as they reduce the need for data transfers in distributed systems. Instead of transmitting a large set of data, one can send a Counting Bloom filter representation.

Handling Duplicate Entries

Counting Bloom filters can handle duplicate entries, making them ideal for counting distinct items in a stream.

Tolerating Errors

They can tolerate false positives, meaning that they might sometimes indicate that an item is in the set when it’s not. This can be acceptable in certain scenarios where a small error rate is acceptable for the benefit of significantly reduced memory usage.

However, one should be aware that Counting Bloom Filters can have false positives (indicating an element is present when it is not) and they do not store the actual elements, so they cannot be used to retrieve the original data.

Hashing with counter to manage count of apperances.

Where ?

Web Caching

To keep track of web pages that are in cache, and remove them when they are no longer needed or when the cache becomes full.

Network Routing

In a network router, it can be used to keep track of packets or routes that have been used recently and remove them when they become irrelevant.

Internet Traffic Management

Counting Bloom filters can be used to identify malicious IP addresses and remove them from the network.

Database Systems

To manage a list of transactions, and remove them once they have been processed.

Big Data

In a distributed system, Counting Bloom Filters can be used to efficiently check the existence of an element without actually moving the data.

How ?

The Counting Bloom filter allows approximating the number of times each element has been seen in the filter by incrementing the corresponding counter every time the element is added. The associated CountingBloomFilter data structure contains a bit array and the array of counters of length m, all initialized to zeros. When a new element is inserted into CountingBloomFilter, first compute its corresponding bit-positions, then for each position, we increment the associated counter, These are some basic ideas to code in any preferred language.

In Java, you can use the Google Guava library to implement Counting Bloom Filter. Here is a basic example:

Add Spring apache commons dependencies in pom.xml:

<dependency> 

    <groupId>org.apache.commons</groupId>

   <artifactId>commons-collections4</artifactId> 

  <version>4.4</version> 

</dependency>        

Create CountingBloomFilter class

The CountingBloomFilter in this code is a bit of a misnomer, as it actually implements a basic Bloom filter. A true counting Bloom filter would allow for the removal of elements and would use an array of counters instead of a BitSet. Each time an element is added, the counters at each of the hash function indices are incremented, and when an element is removed, these counters are decremented.

The contains(String value) method checks whether a string might be in the Bloom filter. It hashes the string with each hash function in func and checks whether the corresponding bit in bits is true. If any bit is not set (`false`), it means the string definitely isn’t in the set. If all bits are set (`true`), the string might be in the set, but it could also be a false positive.

package org.example;

import java.util.BitSet;

public class CountingBloomFilter {

private static final int DEFAULT_SIZE = 2 << 24;
private static final int[] seeds = new int[]{7, 11, 13, 31, 37, 61};
private BitSet bits = new BitSet(DEFAULT_SIZE);
private SimpleHash[] func = new SimpleHash[seeds.length];

public CountingBloomFilter() {
for (int i = 0; i < seeds.length; i++) {
func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]);
}
}

public void add(String value) {
for (SimpleHash f : func) {
bits.set(f.hash(value), true);
}
}

public boolean contains(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (SimpleHash f : func) {
ret = ret && bits.get(f.hash(value));
}
return ret;
}
}        

Create SimpleHash class

The code defines a simple custom hash function implementation in Java.It performs a bitwise XOR operation (`^`) between h and h right-shifted by 16 bits (`h >>> 16`). Bitwise operation on the value of hash code helps in distributing the higher bits into lower ones, which can be beneficial when the hash table size is less than the max value of integer and reduces the risk of collisions in the hash function.

package org.example;

public class SimpleHash {

private int cap;
private int seed;

public SimpleHash(int cap, int seed) {
this.cap = cap;
this.seed = seed;
}

public int hash(Object value) {
int h;
return (value == null) ? 0 : Math.abs(seed * (cap - 1) & ((h = value.hashCode()) ^ (h >>> 16)));
}
}        

Sample Result

Important Considerations

1) False Positives: Counting Bloom filters can return false positives, indicating that an element might be present when it actually isn’t. The probability of false positives can be controlled by adjusting the filter’s parameters.

2) Counting Limitations: The approximateCount() method is an approximation, and its accuracy can decrease as the count increases or the filter becomes saturated.

3) Library Choice: The example uses the Apache Commons Collections library, but there are other Java libraries available for implementing Counting Bloom filters, such as Guava.

Counting Bloom Filters’ functions

1) Initialization — This is the first step in using a Counting Bloom Filter. It involves creating an array of a certain size (all initially set to 0), and deciding on the number of hash functions to be used. The number of hash functions and the size of the array are related to the desired false positive rate and the expected number of elements to be added.

2) Hashing — This involves using multiple hash functions on the input data. Each hash function will give an index into the array. The idea behind using multiple hash functions is to distribute data evenly throughout the filter.

3) Adding — When an element is added to the Counting Bloom Filter, all the hash functions are computed for that element, and for each resulting hash value, the corresponding element in the array is incremented by 1.

4) Remove — Unlike standard Bloom Filters, Counting Bloom Filters allow for removal of elements. To remove an element, compute all the hash functions for that element, and for each resulting hash value, decrement the corresponding array element by 1. It’s important to note, however, that if you remove an element that was never added, it could lead to false negatives.

5) Lookup — check whether an element is in the filter, compute all the hash functions for that element, and check the corresponding array values. If all array values are greater than 0, then the element might be in the set. If any value is 0, then the element is definitely not in the set. It’s important to note that Bloom Filters can have false positives but not false negatives. In other words, it may say an element is present when it’s not (false positive), but it will never say an element is absent when it’s actually present (false negative). In the case of Counting Bloom Filters, false negatives can occur if an element was removed that was never added.

References:

https://www.geeksforgeeks.org/counting-bloom-filters-introduction-and-implementation/ https://medium.com/analytics-vidhya/cbfs-44c66b1b4a78 https://www.baeldung.com/guava-bloom-filter

要查看或添加评论,请登录

社区洞察

其他会员也浏览了