Bloom Filters
Imagine walking into a mobile shop that only sells Samsung and OnePlus phones. You ask the shopkeeper for an iphone. Without wasting any time, he says, “No, we don’t sell iphones.” Now, you change your mind to buy a One Plus model(a rare one). Instead of denying straightaway, the shopkeeper responds, “It might be in stock, but I can’t be 100% sure until I look.”
This is similar to how a Bloom Filter works, it can definitely tell you if something doesn’t exist(like an iphone in an android shop), but if it says something might exist, there is a certain uncertainty of the data’s presence.
To keep it more simple, Bloom Filter can give you a false positive(i.e. return true for something that is false). But it can never give you a false negative(i.e. return false for something that is true).
We all have encountered those warning messages while signing up on some platform that says, “Username already taken/exists”. That’s Bloom Filters in picture(not always as there are some more methods like querying the database or cache but that becomes inefficient when the user traffic explodes).
Let’s understand this Data Structure bit by bit:
A Bloom Filter is a probabilistic data structure. Now what does that mean?
Probabilistic in context of data structures means that this data structure provides answers with a certain possibility of being correct. This filter can definitely say when an element is not present inside it, but it might be wrong in some cases when it says, “Yes, I have this data”.
A bloom filter uses a bit array to store the data. Now when I say it stores the data, it doesn’t store the actual raw data but applies some hash functions
Let’s break this down:
Below is a visual representation of hash functions being applied to an input:
Now, the outputs of the above hash functions(3, 5, 9) are the positions that need to be set. Below is its visual representation:
领英推荐
As you can see the positions 3,5 and 9 are set to 1.
Checking if the data is present in the filter or not:
However, if any of the hash function results point to a position that is still 0, the filter can confidently say, "No, the data is not here." This is because the same input, when hashed, will always give the same result, so if those positions aren't set, the data was never added.
A Bloom filter can give false positives
To minimize the number of false positives
Increasing the number of hash functions exponentially would result in slowing down the process but decreasing them won’t increase the speed but would result in an increased number of false positives.
Read more about how to calculate the above specifications here
That’s all that I know about Bloom Filters as of now. I have used it in my personal projects but not on big scale ones so the choice of using this data structure should be entirely yours. I’m attaching a link where you can experiment on this data structure and find out how many times you get a false positive?
Here you go:
If you read it till here, I’d like you to please share your story in case you used this data structure in your experience. If not used, do consider giving it a try and if you liked this post, please like, comment and share to increase the reach. You can also follow me for more such content. I post on weekends. Thanks :)