Coding Challenge  #53 - Bloom Filter Spell Checker

Coding Challenge #53 - Bloom Filter Spell Checker

This challenge is to build your own micro spell checker. The goal is to create a spell checker that can determine if a word is probably spelt correctly without having to store the full list of words. Thus the spell checker can use less storage (disk or memory). A task that is much less relevant these days, but 20 years ago was incredibly useful on low storage devices.

Whilst we don’t have this problem with spell checking today, we do have similar problems checking if an item is in a data set that would be too big to store efficiently.

So how are these problems solved? With Bloom filters.

A bloom filter is a probabilistic data structure. It is built around a bit array and one or more hash functions. It provides a fast and efficient way of handling set membership queries when the set either does not fit within the memory constraints, or querying the full set would incur a performance penalty.

Bloom filters have been used in:

???Spell checkers - like you’re going to build in this Coding Challenge.

???Network routers - speeding up packet routing protocols amongst other uses.

???Databases - a quick way of determining if if a row is likely to be in the database without performing costly disk lookups. Used in Google’s Bigtable, Apache HBase, Apache Cassandra and Postgres

???Web crawlers - don’t re-crawl a URL that’s already been seen.

???Cyber security - blacklisting, whitelisting and password / username checkers

???Web caches - preventing ‘one hit wonders’ being stored in disk caches for CDNs.

???Cryptocurrency - Speeding up wallet synchronisation and finding logs in the blockchain.

???Recommendation systems - Medium has used Bloom filters to not recommend articles you’ve already seen.

If You Enjoy Coding Challenges Here Are Three Ways You Can Help Support It

  1. Refer a friend or colleague to the newsletter - forward this one to them. ??
  2. Sign up for a paid subscription - think of it as buying me a coffee ?? twice a month, with the bonus that you also get 20% off any of my courses.
  3. Buy one of my courses that walk you through a Coding Challenge.

The Challenge - Building A Simple Spell Checker Using a Bloom Filter

In this challenge you’re going to implement a Bloom filter, and insert thousands of words from a dictionary. You’ll then test other words against the dictionary to see if they’re probably spelt correctly.

Step Zero

In this step you’re going to set your environment up ready to begin developing and testing your solution. I’ll leave you to setup your IDE / editor of choice and programming language of choice for building command line tools.

If you’re not on a Unix like operating system you’ll need to get a dictionary file for testing.

If you are on a Unix-like OS you should be able to generate one by running the command:

% cat /usr/share/dict/words >> dict.txt        

Step 1

In this step your goal is to use test driven development (TDD) to develop the Bloom filter data structure. OK you don’t have to use TDD, but I’d suggest writing some tests either before or after writing the Bloom filter, both to test that it works correctly and to help you understand how it works.

To implement a Bloom filter:

  1. Determine the number of items that are likely to be stored and the probability of false positives you system can tolerate then use that to determine the memory requirements and number of has functions needed. There’s a formula, to use for this.
  2. Create a bit array and set all bits to zero.
  3. Insert items by applying the hash functions and setting the corresponding bits to one.
  4. Query for the presence of an item by applying the hash functions. If any of the corresponding bits are zero, the item is definitely not in the set. If all of the bits are one, the item is probably in the set.

To determine the how many hash functions you should use and the interaction between the number of bits, the number of items and the number of hash functions there are a set of formulas which you’ll find documented on the Bloom filter Wikipedia page.

You can use hash functions from your programming languages standard library, implement your own or implement an existing one yourself. For the latter I suggest you check out the Fowler–Noll–Vo hash function and implement a couple of the versions of it. It’s a quick and easy hash function to implement if you’ve never written one before.

Continued...

The steps for the rest of this Coding Challenge are detailed on the website as Build Your Own Spell Checker Using A Bloom Filter.


Share Your Solutions!

If you think your solution is an example of the developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Twitter or LinkedIn or just post about it there and tag me.

Request for Feedback

I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback greatly appreciated.

You can reach me on Twitter, LinkedIn or through SubStack.

Thanks and happy coding!

John


Caleb Belk

Full-Stack Developer

1 年

What a coincidence. I’ve been building bloomfilter spell checkers a lot in Byron Sommardahl’s dev amplifier program. Then I just signed up for these coding challenges and the first one is… bloom filters. Why does everyone want me to build a bloomin bloom filter! ??

FOJLA RABBI

Full Stack Web Developer ? Helping startup companies founder save time & money ? Helping coaches at software agencies.

1 年

Building real-world applications is a fantastic way to learn and grow as a dev.

要查看或添加评论,请登录

John Crickett的更多文章

  • From The Challenges - Spotify Client

    From The Challenges - Spotify Client

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    3 条评论
  • Coding Challenge #86 - Strace

    Coding Challenge #86 - Strace

    This challenge is to build your own version of Linux tool strace. It’s a useful tool to debug programs that you don’t…

    10 条评论
  • From The Challenges - Memcached Server

    From The Challenges - Memcached Server

    ?? ?? Happy Birthday Coding Challenges - Two Years Old! ?? ?? Coding Challenges is turning two this week! To celebrate…

    13 条评论
  • Coding Challenge #85 - Time Zone Converter

    Coding Challenge #85 - Time Zone Converter

    Coding Challenge #85 - Time Zone Converter This challenge is to build your own Time Zone Converter. That is a tool to…

    12 条评论
  • From The Challenges - IRC Client

    From The Challenges - IRC Client

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    5 条评论
  • Coding Challenge #84 - Mandelbrot Set Explorer

    Coding Challenge #84 - Mandelbrot Set Explorer

    This challenge is to build your own Mandelbrot set explorer. The Mandelbrot set is a set of fractals that exhibit great…

    4 条评论
  • From The Challenges - Cat

    From The Challenges - Cat

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    7 条评论
  • Coding Challenge #83 - Markdown Presentation Tool

    Coding Challenge #83 - Markdown Presentation Tool

    Coding Challenge #83 - Markdown Presentation Tool This challenge is to build your own version of Go’s Present or…

    3 条评论
  • From The Challenges - Shell

    From The Challenges - Shell

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    3 条评论
  • Coding Challenge #82 - Markdown To PDF Editor

    Coding Challenge #82 - Markdown To PDF Editor

    Coding Challenge #82 - Markdown To PDF Editor This challenge is to build your own tool to convert markdown to PDF. It…

    14 条评论

社区洞察

其他会员也浏览了