登录查看更多内容

Demystifying Bloom Filters: Simplifying the Complexity

Manish Joshi

Senior Software Engineer

发布日期: 2024年2月17日

Imagine you have a big pile of information, and you want to quickly figure out if a specific thing is in there or not. This is a common challenge in various computer tasks like checking spelling, finding duplicate items, managing computer networks, and handling storage efficiently. The usual methods, like using special tables called hash tables, can be slow and use a lot of computer power/network badnwidth, especially when dealing with a lot of data.

In the world of big data and systems that are spread across many computers, this challenge becomes even trickier. Sometimes, we just want to know if a piece of information is in a database, that's not on the same computer we're using. For instance, think about a system like tinyurl, where changing even one character in a web link can make it different link and checking if this modified link is in the database might take a server resource and using this trick any user can slow down servers and abuse system.

This is where Bloom Filters come into play, offering an ingenious solution to significantly improve efficiency base on probabilistic model.

The Bloom filter

Demonstrate adding key and querying key from bloom filter

A Bloom Filter is a simple and space-efficient data structure primarily consisting of a bit vector and a set of hash functions.

Let's break down its key components:

Bit Vector and Hash Functions:The core of a Bloom Filter is a bit vector, which is essentially an array of bits (0s and 1s). Each bit in the vector represents a position, and initially, all bits are set to 0. It use a set of hash functions (typically denoted as 'k') to map elements to positions in the bit vector. For a given element, each hash function produces a unique position in the bit vector.

class BloomFilter {
    typedef std::function<size_t(const std::string&)> hashFunctions;
    
    std::vector<bool>             _bitVector;
    std::vector<hashFunction>     _hashFunctions;
    size_t                        _size;
};

Setting/removing Bits:When an element is added to the Bloom Filter, each hash function calculates a position in the bit vector.The corresponding bits at these positions are then set to 1. For example , if the bloom filter size is 32 and you has k=3(has functions) then for each hash function will return index value between 0 to 31 for given key. All 3 index will marked true as demonstrated below.

    // Add an element to the Bloom Filter
    void add(const std::string& element) {
        for (const auto& hashFunction : hashFunctions) {
            size_t index = hashFunction(element) % size;
            bitVector[index] = true;
        }
    }

Removal is generally not supported in Bloom Filters, as it may affect other elements due to hash collisions. However it can be implemented same as adding key just by marking set false instead true as demonstrate above.

Collision Handling and False Positive case:Bloom Filters do not resolve collisions. If two different elements produce the same bit positions (due to hash collisions), their bits will be set. Because of this there can be some false positives during membership tests.The likelihood of false positives in a Bloom Filter is determined by the number of hash functions ('k'), the size of the bit vector(m), and the number of elements added to the filter(n). The formula for the probability of a false positive is given by (1 - e^(-k * n / m))^k

Example Calculation: Suppose we have a Bloom Filter with a bit vector size of 128 bits (m = 128) and use 3 hash functions (k = 3). If we insert 10 elements (n = 10), we can estimate the false positive rate.

领英推荐

??GovCon Market Intelligence by G2Xchange | 6-13-24

G2X - The GovCon Growth Platform 9 个月前

??GovCon Insights by G2Xchange | 4-24-24

G2X - The GovCon Growth Platform 10 个月前

??GovCon Insights by G2Xchange | 11-22-23

G2X - The GovCon Growth Platform 1 年前

Using the formula: (1 - e^(-3 * 10 / 128))^3 = 0.00912

Calculating this, we get an approximate false positive rate as shown above and it generally small.

Note that this is a probabilistic estimation, and actual results may vary based on the characteristics of the data and hash functions used.

Advantage of Bloom Filter

Bloom Filters present a space-efficient solution with constant-time complexity, whether implemented from scratch or integrated through existing libraries. Offering a guarantee of absence when reporting an element as not present, the filter ensures conclusiveness by checking corresponding unset bits for hash functions. However, in positive results ('present'), there exists a controlled probability of false positives due to hash collisions, providing a quick but not entirely certain answer.

Additionally, Bloom Filters support parallelization, allowing for efficient membership checks of multiple elements simultaneously. Their lightweight design is particularly advantageous in distributed systems, facilitating the quick verification of key presence across multiple machines while minimizing communication overhead. Incorporating Bloom Filters into your toolkit can prove transformative, enhancing system efficiency and streamlining operations.

Existing libraries

For those seeking to leverage Bloom Filters without delving into low-level implementations, various libraries provide ready-to-use solutions. Among these, the Google's Guava Library for Java offers a BloomFilter class with an intuitive API. Similarly, the Python pybloom_live library provides Bloom Filter functionality for Python developers.

By integrating these libraries into your projects, you can harness the power of Bloom Filters without the need to implement them from scratch. This significantly reduces development time and ensures robust, tested solutions.

Bloom Filters, with their space-efficient design and constant-time complexity, offer a compelling solution. Whether you choose to implement one from scratch or utilize existing libraries, incorporating Bloom Filters into your toolkit can be a game-changer, streamlining operations and improving overall system efficiency.

Tech knowledge

511 位关注者

要查看或添加评论，请登录

Manish Joshi的更多文章

From Crash to Recovery: The Power of the ARIES Algorithm

2024年6月3日

From Crash to Recovery: The Power of the ARIES Algorithm

The ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) algorithm is a powerful tool used in SQL database…
Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

2024年5月3日

Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

In the world of big data, one common problem is the count-distinct elements problem, also known as the cardinality…
Database Locking Strategies: Balancing Act Between Speed and Reliability

2024年2月25日

Database Locking Strategies: Balancing Act Between Speed and Reliability

As a database engineer, one of our paramount concerns is the delicate balancing act between ensuring quick and…

1 条评论
The Art of System Design: A Journey with Jinja, Flask, and More!

2024年1月3日

The Art of System Design: A Journey with Jinja, Flask, and More!

How do you truly master system design? The answer lies in getting your hands dirty. While someone can introduce you to…
Best Practices for Creating Change Lists in Code Development

2023年12月22日

Best Practices for Creating Change Lists in Code Development

As a part of the infrastructure team, I often find myself writing a significant amount of code. This could involve…
Mastering Different Programming Languages with Ease

2023年11月24日

Mastering Different Programming Languages with Ease

In today's ever-changing world of software development, being good at different programming languages isn't just about…
From Mystery to Mastery: How Gradient Descent is Reshaping Our World

2023年9月2日

From Mystery to Mastery: How Gradient Descent is Reshaping Our World

Gradient descent, a crucial part of modern optimization and machine learning, has a history that goes back a long way…

1 条评论
Stop using std::stack and std::queue, try out the power of std::list

2023年8月17日

Stop using std::stack and std::queue, try out the power of std::list

The std::list container in CPP offers a versatile alternative that can replace not only stack, and queue but also…
Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

2023年8月14日

Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

In the previous article Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering, we embarked…

1 条评论
Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering

2023年8月6日

Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering

In the ever-evolving field of software engineering, search algorithms are indispensable tools for solving many…

2 条评论

See all articles

Demystifying Bloom Filters: Simplifying the Complexity

Manish Joshi

Senior Software Engineer

The Bloom filter

领英推荐

Advantage of Bloom Filter

Existing libraries

Tech knowledge

511 位关注者

Manish Joshi的更多文章

社区洞察

其他会员也浏览了

Preventing Read Disturb Errors in Hot/Cold Storage

The roadmap continued - Piller V - Data

What strategies do you use for cache invalidation?

Protecting Data Reliability: The Essential Role of ACID Principles

Error Management in CXL Flits: Safeguarding Data Transmission and Integrity

Storage and Data Protection News for the Week of March 1: Updates from Cohesity, Sigma, Veeam, and More

KuppingerCole Predicts Data Catalogs & Metadata to reach 2.44 bn by 2025

Learn Hashing : Magic Behind Data Retrieval

Wisdom Requires Focus

ThingsDB 1.5: Unleashing Enhanced Capabilities!

The Bloom filter

领英推荐

Advantage of Bloom Filter

Existing libraries

Tech knowledge

511 位关注者

Manish Joshi的更多文章

From Crash to Recovery: The Power of the ARIES Algorithm

Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

Database Locking Strategies: Balancing Act Between Speed and Reliability

The Art of System Design: A Journey with Jinja, Flask, and More!

Best Practices for Creating Change Lists in Code Development

Mastering Different Programming Languages with Ease

From Mystery to Mastery: How Gradient Descent is Reshaping Our World

Stop using std::stack and std::queue, try out the power of std::list

Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering

社区洞察

其他会员也浏览了

Preventing Read Disturb Errors in Hot/Cold Storage

The roadmap continued - Piller V - Data

What strategies do you use for cache invalidation?

Protecting Data Reliability: The Essential Role of ACID Principles

Error Management in CXL Flits: Safeguarding Data Transmission and Integrity

Storage and Data Protection News for the Week of March 1: Updates from Cohesity, Sigma, Veeam, and More

KuppingerCole Predicts Data Catalogs & Metadata to reach 2.44 bn by 2025

Learn Hashing : Magic Behind Data Retrieval

Wisdom Requires Focus

ThingsDB 1.5: Unleashing Enhanced Capabilities!