Why Bloom Filters are preferred for some Read-Heavy Systems?
Pradip Mudi
Senior Engineer at apreeHealth | NIT Warangal | Distributed Systems | Backend | Microservices | Java
Ever wondered how Google Chrome or other browsers identifies malicious websites in a?flash?
Or How any of the apps like Word or Grammarly, quickly tells you if any spelling is?wrong?
Etc etc.
Take a moment and think??
What data structures or tools comes to your mind for the above usecases?
You might think of HashMap or HashSet or you might think I’ll throw up a Redis cluster for cache and store these details, right.
Before we delve into further discussion, let’s consider the website use case and do some math.
Some back-of-the-envelope estimations for the website usecase
Each website’s URL consists of 75–120 characters. Mean character size of 76.97.
1 character = 1 byte
76.97 character * 1 byte = 76.97 bytes
Total websites: 1.13 billion * 76.97 bytes = 86.9761 GB
Or let’s take the case of, Active websites: 200 million * 76.97bytes = 15.39400 GB
Given these estimates, using traditional data structures like HashMap or HashSet to hold this data becomes impractical.
So what about Redis?cache?
Even employing Redis cache would demand significant memory resources:
Challenges and Solutions
Handling 3.22 billion daily active users(for Google) accessing Redis caches would introduce significant overheads, especially in a read-heavy system where data retrieval is frequent:
Addressing these challenges requires careful planning, robust infrastructure, and ongoing optimization.
So how does Google Chrome or other browsers address this?issue?
Addressing these challenges necessitates careful planning, robust infrastructure, and ongoing optimization. But how does Google address these issues efficiently? They leverage Bloom Filters.
What’s a Bloom?Filter?
Bloom filter is a probabilistic data structure, that quickly checks if an item might be in a set. It’s fast but may give occasional wrong answers i.e. false positives.
How It Works:
Example (Malicious Websites Detection):
Adding Malicious Websites:
Got hash values as 3, 7, and 8, so we set those positions value as 1.
领英推荐
Visualization when someone hits a URL in google?chrome:
Checking Membership:
This illustrates how Bloom filters quickly identify potential matches in a set, making them useful for tasks like malware detection.
Let’s say Google is creating an array of size 20 million (actual size is unknown) to create a Bloom filter to map the malicious websites.
The calculation would be as follows:
Total size in bytes
= (Size of one bit) * (Size of the array)
= (1 byte) * (20,000,000 bits)
= 20,000,000 bytes
Total size in MB ≈ 20,000,000 bytes / 1,048,576 bytes/MB ≈ 19.073 MB
So, the size of the bit array would be approximately 19.073 MB, that’s basically peanuts
As you can see, with an array size of 20 million bits, the size of the bit array is approximately 19.073 MB, which is peanuts.
So google stores this bloom filter in your browser cache irrespective of mobile or computer, same goes for other web browsers.
The bigger the Bloom filter array, the better the results.
What are the advantages for Google and other companies, for holding this cache on the client?end?
Benefits of holding this cache on the client end include:
Other use cases for Bloom filters?include:
Disadvantages of Bloom?filter:
Despite their benefits, Bloom filters can yield false positives, requiring additional verification for flagged items.
Implementation:
It’s usecases made me curious and I have implemented a lightweight malicious websites detector project using bloom filter. Please feel free to play with it.
Key Features:
Link to the project:
Nothing fancy, built it from pure curiosity
Conclusion
Bloom filters offer an elegant solution to the challenges posed by read-heavy systems, providing lightning-fast operations with minimal overhead. By efficiently identifying potential matches in a set, Bloom filters enhance the performance of tasks such as malware detection, spell checking, and cache management. Embracing such innovative data structures not only optimizes performance but also streamlines resource utilization in modern computing environments, particularly for read-heavy systems.
Please let me know, your thoughts in the comments.
Happy learning!?:)
Resources: