Coding Challenge #3
Photo by Markus Spiske on Unsplash

Coding Challenge #3

This challenge is to build your own command like tool to compress text files. This is a challenge I first did in 1998 when my employer didn’t have much work for me to do between projects so suggested I pick a skill and polish it.

I picked C and decided to implement a few of the data structures and algorithms from my university course again in C (we did Pascal on the course). At the same time I wanted to start factoring in some of the ‘mechanical sympathy’ that I’d been reading about in Michael Abrash's excellent Graphics Programming Black Book.

Despite the name the book covers a lot more than just graphics programming - it really digs into understanding the architecture of a computer and how that has an impact on the performance of your code. For example when I re-wrote of part the compression tool that this challenge is about to take into account the hard disk buffer size (back before SSDs) the performance on a one megabyte file jumped 100x!

These days the equivalent would be understand the cache hierarchies in a modern CPU and how programming with an awareness of them can radically change the performance of your code. Scott Meyers did great talk that digs into this. You can find the slides on his website here.

The Challenge - Building A Huffman Encoder/Decoder

In the early 1950s David Huffman developed an algorithm to find the optimal prefix code that can be used for lossless data compression.

Given there is usually an unequal distribution of character occurrences in text this can then be used to compress data by giving the most commonly occurring characters the shortest prefix.

For example if we have the string aaabbc, it would normally take up 6 bytes, but if we assign each character a variable length code, with the most frequently occurring character has the shortest code we might give them the following codes:

a: 1
b: 01
c: 10        

and we could reduce the string aaabbc (six bytes) to 111010110 (nine bits). It’s not quite that simple though as we need to ensure that the codes are prefix-free, that is the bit string representing one character is not a prefix of a bit string representing another.

These prefix-free codes are generated by creating a binary tree. Before we get into that let’s consider the steps involved in compression using Huffman Codes:

  1. Read the text and determine the frequency of each character occurring.
  2. Build the binary tree from the frequencies.
  3. Generate the prefix-code table from the tree.
  4. Encode the text using the code table.
  5. Encode the tree - we’ll need to include this in the output file so we can decode it.
  6. Write the encoded tree and text to an output field

Step Zero

In most (Fortran and COLBOL are exceptions) programming languages we index arrays from the element zero onwards because arrays are represented as a pointer to the beginning and an offset from that beginning - i.e. if you’re familiar with C then array[index] is equivalent to *(array + index).

As usually, I’ll leave you to setup your IDE / editor of choice and programming language of choice. After that here’s what I’d like you to do to be ready to test your solution.

After that, please download Les Misérables, by Victor Hugo from Project Gutenberg here: https://www.gutenberg.org/files/135/135-0.txt which will be our test data for this challenge.

Step 1

In this step your goal is to build your tool to accept a filename as input. It should return an error if the file is not valid, otherwise your program should open it, read the text and determine the frequency of each character occurring within the text.

As a debugging aid you might like to optionally log out the table. Here’s some test values to check. There are 333 occurrences of ‘X’ and 223000 of ‘t’.

An alternate to logging the frequency table is to write some unit tests that take in some prepared data and then check that the counting code returns the correct values. Check your chosen programming language’s documentation for it’s unit testing library and if you’ve never tried it before consider giving test driven development a go.

Step 2

In this step your goal is to use the frequencies that you calculated in step 1 to build the binary tree. There’s a good explanation of how to do this complete with a visualisation here: https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/Huffman.html

The examples used for the visualisation would be a good starting point for unit tests.

Step 3

In this step your goal is to use the tree to generate the prefix-code table. Once again there’s an explanation and visualisation at: https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/Huffman.html

Step 4

In this step your goal is to write a header section to the output file (you’ll want a command line option to specify the filename). The header section will include enough information for your program to be decode the compressed file.

One way of doing this is to write out the tree, another option is to write out the character frequency table. Don’t forget you’ll need some way of knowing when the header ends and when the compressed data starts.

Step 5

This is the moment you’ve been building up to! In this step your goal is to encode the text using the code table and write the compressed content of the source text to the output file file, appending it after the header. Don’t forget translate the prefixes into bit strings and pack them into bytes to achieve the compression.

Step 6

In this step your goal is to read in the header of the encoded file and rebuild the prefix table ready to decode the text. In essence you’re going to do the reverse of step 4.

Step 7

In this step you will read in the remainder of the file and decode it using the prefix table before writing the decoded text back to the specified output file - the reverse of step 5.

At this point you should be able to compress the file with your tool, compare the file size of the input and output see that the output is smaller and then decompress the output file to create an identical copy of the input.

If so, congratulations you’ve written a working Huffman encode/decode tool!

Share Your Solutions!

Even though this is only the first coding challenge I’ve already been asked to share model solutions in as many programming languages as possible - which is an awesome idea, but I just don’t have the time and don’t know even half the programming languages out there so I’d like your help!

If you think your solution is an example of the developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know - ping me a message via Twitter or LinkedIn or just post about it there and tag me.

I’ll take a look at every solution and I’ll mention the best solution in the follow week’s newsletter.

Request for Feedback

I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback greatly appreciated.

You can reach me on Twitter, LinkedIn or through SubStack

Thanks and happy coding!

John

This is on the Coding Challenges website as: Write Your Own Compression Tool.

Ahtisham Ilyas

Top Software Testing Voice@2023 | Software Test Engineer | Automation Specialist: Web, Mobile, API | Scrum Fundamental Certified

9 个月

My Solution in java https://github.com/0kakarot0/HuffmanCodingChallenge.git It take me to almost 1 week and 3 days to have it resolved (Took help from various internet source but it was enjoyable). But it helped me to get touched with interesting topics like 1. Simple Binary Tree 2. Huffman coding 3. Linkedhashmap vs Map 4. Conversion into binary string 5. Conversion into byte array 6. a new file type .bin 7. Compression

Mohit Jain

Software Tech Lead @Zemetric | ex Co-Founder @Evy Energy (Acquired)

1 年

The bit packing was something new I learned. The toString() method in JS is a fishy one. It's good to stick with Buffer when playing with bits and characters. Here's my solution - https://github.com/jainmohit2001/coding-challenges/tree/master/src/3 I tried with a 10 MB text file, but somehow the system gave up. I need to make it compatible with large files as well. Let me know if you have suggestions here.

Mikhail Golubitsky

Senior Software Engineer

1 年

This one is a beast, John Crickett. I'm learning a ton and hoping to share a solution tomorrow! I never knew anything about packing bitstrings into bytes before; nor about Huffman Encoding (nor about compression, for that matter).

Grey Newell, M.S.E.

Distributed Systems ?? Senior Solutions Architect at Amazon ?? 12x AWS Certified ?? buildingforscale.com

1 年

We did a project like this in school that was really fun, John Crickett. Are you working towards building your own OS or filesystem?

要查看或添加评论,请登录

John Crickett的更多文章

  • Coding Challenge #86 - Strace

    Coding Challenge #86 - Strace

    This challenge is to build your own version of Linux tool strace. It’s a useful tool to debug programs that you don’t…

    9 条评论
  • From The Challenges - Memcached Server

    From The Challenges - Memcached Server

    ?? ?? Happy Birthday Coding Challenges - Two Years Old! ?? ?? Coding Challenges is turning two this week! To celebrate…

    13 条评论
  • Coding Challenge #85 - Time Zone Converter

    Coding Challenge #85 - Time Zone Converter

    Coding Challenge #85 - Time Zone Converter This challenge is to build your own Time Zone Converter. That is a tool to…

    12 条评论
  • From The Challenges - IRC Client

    From The Challenges - IRC Client

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    5 条评论
  • Coding Challenge #84 - Mandelbrot Set Explorer

    Coding Challenge #84 - Mandelbrot Set Explorer

    This challenge is to build your own Mandelbrot set explorer. The Mandelbrot set is a set of fractals that exhibit great…

    4 条评论
  • From The Challenges - Cat

    From The Challenges - Cat

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    7 条评论
  • Coding Challenge #83 - Markdown Presentation Tool

    Coding Challenge #83 - Markdown Presentation Tool

    Coding Challenge #83 - Markdown Presentation Tool This challenge is to build your own version of Go’s Present or…

    3 条评论
  • From The Challenges - Shell

    From The Challenges - Shell

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    3 条评论
  • Coding Challenge #82 - Markdown To PDF Editor

    Coding Challenge #82 - Markdown To PDF Editor

    Coding Challenge #82 - Markdown To PDF Editor This challenge is to build your own tool to convert markdown to PDF. It…

    14 条评论
  • From The Challenges - Diff

    From The Challenges - Diff

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    7 条评论

社区洞察

其他会员也浏览了