Bloom Filters

Bloom Filters

Imagine walking into a mobile shop that only sells Samsung and OnePlus phones. You ask the shopkeeper for an iphone. Without wasting any time, he says, “No, we don’t sell iphones.” Now, you change your mind to buy a One Plus model(a rare one). Instead of denying straightaway, the shopkeeper responds, “It might be in stock, but I can’t be 100% sure until I look.”

This is similar to how a Bloom Filter works, it can definitely tell you if something doesn’t exist(like an iphone in an android shop), but if it says something might exist, there is a certain uncertainty of the data’s presence.

To keep it more simple, Bloom Filter can give you a false positive(i.e. return true for something that is false). But it can never give you a false negative(i.e. return false for something that is true).

We all have encountered those warning messages while signing up on some platform that says, “Username already taken/exists”. That’s Bloom Filters in picture(not always as there are some more methods like querying the database or cache but that becomes inefficient when the user traffic explodes).

Let’s understand this Data Structure bit by bit:

A Bloom Filter is a probabilistic data structure. Now what does that mean?

Probabilistic in context of data structures means that this data structure provides answers with a certain possibility of being correct. This filter can definitely say when an element is not present inside it, but it might be wrong in some cases when it says, “Yes, I have this data”.

A bloom filter uses a bit array to store the data. Now when I say it stores the data, it doesn’t store the actual raw data but applies some hash functions to those data sets and accordingly marks the corresponding positions in the bit array. For example, if you want to check if “Kartik” exists inside a bloom filter it will apply multiple hash functions to this input and then mark their corresponding indexes as 1.

Let’s break this down:

  1. Input - ‘Kartik’ to be added inside a bloom filter.
  2. Hash functions applied: Since a bloom filter uses multiple hash functions, we will consider it has 3 for better understanding.
  3. Positions to be marked: Since this filter uses a bit array(a relatively large size bit arrray) where all indexes are set to 0(switch off) by default, based on the hash function’s result the positions will be marked as 1(switch on).
  4. That’s it. The data is stored now. Go ahead to check if it is present or not.

Below is a visual representation of hash functions being applied to an input:


Now, the outputs of the above hash functions(3, 5, 9) are the positions that need to be set. Below is its visual representation:


As you can see the positions 3,5 and 9 are set to 1.

Checking if the data is present in the filter or not:

  1. You give the input. Let’s say the input is again “Kartik”.
  2. The same hash functions are applied to “Kartik.”
  3. On applying the hash functions we would get the same result for obvious reasons which would fetch us the same positions inside the bit array i.e. 3,5 and 9.
  4. The filter checks if these positions in the bit array are set to 1.
  5. Oh, these indexes are already marked as 1(they are on). This means the data might be present.(Username already exists, please try using a different username).

However, if any of the hash function results point to a position that is still 0, the filter can confidently say, "No, the data is not here." This is because the same input, when hashed, will always give the same result, so if those positions aren't set, the data was never added.

A Bloom filter can give false positives (saying something might exist when it doesn't), but it can never give false negatives (saying something doesn’t exist when it actually does). So, when you check for “Kartik” and all bits are 1, it means “Kartik” might exist. If even one bit is 0, the filter is sure it doesn't exist.

To minimize the number of false positives you carefully need to decide the length of the bit array to be used(within your system’s capabilities) and also the number of hash functions.

Increasing the number of hash functions exponentially would result in slowing down the process but decreasing them won’t increase the speed but would result in an increased number of false positives.

Read more about how to calculate the above specifications here

That’s all that I know about Bloom Filters as of now. I have used it in my personal projects but not on big scale ones so the choice of using this data structure should be entirely yours. I’m attaching a link where you can experiment on this data structure and find out how many times you get a false positive?

Here you go:

https://llimllib.github.io/bloomfilter-tutorial/

If you read it till here, I’d like you to please share your story in case you used this data structure in your experience. If not used, do consider giving it a try and if you liked this post, please like, comment and share to increase the reach. You can also follow me for more such content. I post on weekends. Thanks :)

要查看或添加评论,请登录

Kartikey Srivastava的更多文章

  • Event Sourcing - A Design Pattern

    Event Sourcing - A Design Pattern

    Event sourcing is a design pattern where state changes are logged as a sequence of immutable events, rather than…

    5 条评论
  • Setting Up Apache Kafka for My Real-Time Analytics Platform

    Setting Up Apache Kafka for My Real-Time Analytics Platform

    After facing multiple issues I've finally setup Apache Kafka . So far : Set up Kafka on my local machine.

    1 条评论
  • RPC - Remote Procedure Call

    RPC - Remote Procedure Call

    Imagine you are working in a microservice architecture where there are 2 services: A and B. Service B can operate only…

  • Chaos Monkey: Unleashing Chaos for Resilient Systems

    Chaos Monkey: Unleashing Chaos for Resilient Systems

    Picture this: You’re lounging on your couch, popcorn in hand, mid-way through your favourite Netflix series when boom…

    2 条评论
  • Encryption —> Symmetric

    Encryption —> Symmetric

    In simple words, converting a readable information(plain text) into something unreadable(also known as cipher text) to…

  • Caching - An overview

    Caching - An overview

    In simple terms, caching means storing the frequently accessed data in a storage where one can quickly retrieve it…

  • How to choose the right database for your system?

    How to choose the right database for your system?

    Being a software developer isn’t just about making a system and getting it running. Imagine you’ve developed an…

    4 条评论
  • Optimizing an API Response

    Optimizing an API Response

    As backend engineers, we frequently deal with API latency issues. It's easy to create an endpoint and invoke a method…

社区洞察

其他会员也浏览了