登录查看更多内容

Coding Challenge #3

John Crickett

Helping you become a better software engineer by building real-world applications.

发布日期: 2023年4月1日

This challenge is to build your own command like tool to compress text files. This is a challenge I first did in 1998 when my employer didn’t have much work for me to do between projects so suggested I pick a skill and polish it.

I picked C and decided to implement a few of the data structures and algorithms from my university course again in C (we did Pascal on the course). At the same time I wanted to start factoring in some of the ‘mechanical sympathy’ that I’d been reading about in Michael Abrash's excellent Graphics Programming Black Book.

Despite the name the book covers a lot more than just graphics programming - it really digs into understanding the architecture of a computer and how that has an impact on the performance of your code. For example when I re-wrote of part the compression tool that this challenge is about to take into account the hard disk buffer size (back before SSDs) the performance on a one megabyte file jumped 100x!

These days the equivalent would be understand the cache hierarchies in a modern CPU and how programming with an awareness of them can radically change the performance of your code. Scott Meyers did great talk that digs into this. You can find the slides on his website here.

The Challenge - Building A Huffman Encoder/Decoder

In the early 1950s David Huffman developed an algorithm to find the optimal prefix code that can be used for lossless data compression.

Given there is usually an unequal distribution of character occurrences in text this can then be used to compress data by giving the most commonly occurring characters the shortest prefix.

For example if we have the string aaabbc, it would normally take up 6 bytes, but if we assign each character a variable length code, with the most frequently occurring character has the shortest code we might give them the following codes:

a: 1
b: 01
c: 10

and we could reduce the string aaabbc (six bytes) to 111010110 (nine bits). It’s not quite that simple though as we need to ensure that the codes are prefix-free, that is the bit string representing one character is not a prefix of a bit string representing another.

These prefix-free codes are generated by creating a binary tree. Before we get into that let’s consider the steps involved in compression using Huffman Codes:

Read the text and determine the frequency of each character occurring.
Build the binary tree from the frequencies.
Generate the prefix-code table from the tree.
Encode the text using the code table.
Encode the tree - we’ll need to include this in the output file so we can decode it.
Write the encoded tree and text to an output field

Step Zero

In most (Fortran and COLBOL are exceptions) programming languages we index arrays from the element zero onwards because arrays are represented as a pointer to the beginning and an offset from that beginning - i.e. if you’re familiar with C then array[index] is equivalent to *(array + index).

As usually, I’ll leave you to setup your IDE / editor of choice and programming language of choice. After that here’s what I’d like you to do to be ready to test your solution.

After that, please download Les Misérables, by Victor Hugo from Project Gutenberg here: https://www.gutenberg.org/files/135/135-0.txt which will be our test data for this challenge.

Step 1

In this step your goal is to build your tool to accept a filename as input. It should return an error if the file is not valid, otherwise your program should open it, read the text and determine the frequency of each character occurring within the text.

As a debugging aid you might like to optionally log out the table. Here’s some test values to check. There are 333 occurrences of ‘X’ and 223000 of ‘t’.

An alternate to logging the frequency table is to write some unit tests that take in some prepared data and then check that the counting code returns the correct values. Check your chosen programming language’s documentation for it’s unit testing library and if you’ve never tried it before consider giving test driven development a go.

Step 2

In this step your goal is to use the frequencies that you calculated in step 1 to build the binary tree. There’s a good explanation of how to do this complete with a visualisation here: https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/Huffman.html

The examples used for the visualisation would be a good starting point for unit tests.

领英推荐

Understanding Pinning in?Rust

Luis Soares 4 个月前

Extending the Power of Logic Simulations using the…

Simon Southwell 2 年前

Recursive and Tail Recursive Function in DataWeave 2.0

Shubham Shingi 1 年前

Step 3

In this step your goal is to use the tree to generate the prefix-code table. Once again there’s an explanation and visualisation at: https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/Huffman.html

Step 4

In this step your goal is to write a header section to the output file (you’ll want a command line option to specify the filename). The header section will include enough information for your program to be decode the compressed file.

One way of doing this is to write out the tree, another option is to write out the character frequency table. Don’t forget you’ll need some way of knowing when the header ends and when the compressed data starts.

Step 5

This is the moment you’ve been building up to! In this step your goal is to encode the text using the code table and write the compressed content of the source text to the output file file, appending it after the header. Don’t forget translate the prefixes into bit strings and pack them into bytes to achieve the compression.

Step 6

In this step your goal is to read in the header of the encoded file and rebuild the prefix table ready to decode the text. In essence you’re going to do the reverse of step 4.

Step 7

In this step you will read in the remainder of the file and decode it using the prefix table before writing the decoded text back to the specified output file - the reverse of step 5.

At this point you should be able to compress the file with your tool, compare the file size of the input and output see that the output is smaller and then decompress the output file to create an identical copy of the input.

If so, congratulations you’ve written a working Huffman encode/decode tool!

Share Your Solutions!

Even though this is only the first coding challenge I’ve already been asked to share model solutions in as many programming languages as possible - which is an awesome idea, but I just don’t have the time and don’t know even half the programming languages out there so I’d like your help!

If you think your solution is an example of the developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know - ping me a message via Twitter or LinkedIn or just post about it there and tag me.

I’ll take a look at every solution and I’ll mention the best solution in the follow week’s newsletter.

Request for Feedback

I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback greatly appreciated.

You can reach me on Twitter, LinkedIn or through SubStack

Thanks and happy coding!

John

This is on the Coding Challenges website as: Write Your Own Compression Tool.

Coding Challenges

129,689 位关注者

Ahtisham Ilyas

Top Software Testing Voice@2023 | Software Test Engineer | Automation Specialist: Web, Mobile, API | Scrum Fundamental Certified

9 个月

My Solution in java https://github.com/0kakarot0/HuffmanCodingChallenge.git It take me to almost 1 week and 3 days to have it resolved (Took help from various internet source but it was enjoyable). But it helped me to get touched with interesting topics like 1. Simple Binary Tree 2. Huffman coding 3. Linkedhashmap vs Map 4. Conversion into binary string 5. Conversion into byte array 6. a new file type .bin 7. Compression

1 次回应

Mohit Jain

Software Tech Lead @Zemetric | ex Co-Founder @Evy Energy (Acquired)

1 年

The bit packing was something new I learned. The toString() method in JS is a fishy one. It's good to stick with Buffer when playing with bits and characters. Here's my solution - https://github.com/jainmohit2001/coding-challenges/tree/master/src/3 I tried with a 10 MB text file, but somehow the system gave up. I need to make it compatible with large files as well. Let me know if you have suggestions here.

1 次回应

Mikhail Golubitsky

Senior Software Engineer

1 年

This one is a beast, John Crickett. I'm learning a ton and hoping to share a solution tomorrow! I never knew anything about packing bitstrings into bytes before; nor about Huffman Encoding (nor about compression, for that matter).

1 次回应

Grey Newell, M.S.E.

Distributed Systems ?? Senior Solutions Architect at Amazon ?? 12x AWS Certified ?? buildingforscale.com

1 年

We did a project like this in school that was really fun, John Crickett. Are you working towards building your own OS or filesystem?

1 次回应

查看更多评论

要查看或添加评论，请登录

John Crickett的更多文章

Coding Challenge #86 - Strace

2025年3月22日

Coding Challenge #86 - Strace

This challenge is to build your own version of Linux tool strace. It’s a useful tool to debug programs that you don’t…

9 条评论
From The Challenges - Memcached Server

2025年3月15日

From The Challenges - Memcached Server

?? ?? Happy Birthday Coding Challenges - Two Years Old! ?? ?? Coding Challenges is turning two this week! To celebrate…

13 条评论
Coding Challenge #85 - Time Zone Converter

2025年3月8日

Coding Challenge #85 - Time Zone Converter

Coding Challenge #85 - Time Zone Converter This challenge is to build your own Time Zone Converter. That is a tool to…

12 条评论
From The Challenges - IRC Client

2025年3月1日

From The Challenges - IRC Client

Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

5 条评论
Coding Challenge #84 - Mandelbrot Set Explorer

2025年2月22日

Coding Challenge #84 - Mandelbrot Set Explorer

This challenge is to build your own Mandelbrot set explorer. The Mandelbrot set is a set of fractals that exhibit great…

4 条评论
From The Challenges - Cat

2025年2月15日

From The Challenges - Cat

Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

7 条评论
Coding Challenge #83 - Markdown Presentation Tool

2025年2月8日

Coding Challenge #83 - Markdown Presentation Tool

Coding Challenge #83 - Markdown Presentation Tool This challenge is to build your own version of Go’s Present or…

3 条评论
From The Challenges - Shell

2025年2月1日

From The Challenges - Shell

Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

3 条评论
Coding Challenge #82 - Markdown To PDF Editor

2025年1月25日

Coding Challenge #82 - Markdown To PDF Editor

Coding Challenge #82 - Markdown To PDF Editor This challenge is to build your own tool to convert markdown to PDF. It…

14 条评论
From The Challenges - Diff

2025年1月18日

From The Challenges - Diff

Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

7 条评论

See all articles

Coding Challenge #3

John Crickett

Helping you become a better software engineer by building real-world applications.

The Challenge - Building A Huffman Encoder/Decoder

Step Zero

Step 1

Step 2

领英推荐

Step 3

Step 4

Step 5

Step 6

Step 7

Share Your Solutions!

Request for Feedback

Coding Challenges

129,689 位关注者

John Crickett的更多文章

社区洞察

其他会员也浏览了

??Free Form Programming Language: First Beta (2024.1b0)

Decoding the 'Virtual' Function in C++ Programming Language

The Hidden Pitfall of Adding Elements to Vectors in Range-Based For Loops

Reusable Finite State Machine using C++

Day 06 to 09: Coding Problems

Bit Manipulation

The State of State

Uses of C++ In Real World

Liskov’s Substitution Principle | SOLID as a Rock

The Challenge - Building A Huffman Encoder/Decoder

Step Zero

Step 1

Step 2

领英推荐

Step 3

Step 4

Step 5

Step 6

Step 7

Share Your Solutions!

Request for Feedback

Coding Challenges

129,689 位关注者

John Crickett的更多文章

Coding Challenge #86 - Strace

From The Challenges - Memcached Server

Coding Challenge #85 - Time Zone Converter

From The Challenges - IRC Client

Coding Challenge #84 - Mandelbrot Set Explorer

From The Challenges - Cat

Coding Challenge #83 - Markdown Presentation Tool

From The Challenges - Shell

Coding Challenge #82 - Markdown To PDF Editor

From The Challenges - Diff

社区洞察

其他会员也浏览了

??Free Form Programming Language: First Beta (2024.1b0)

Decoding the 'Virtual' Function in C++ Programming Language

The Hidden Pitfall of Adding Elements to Vectors in Range-Based For Loops

Reusable Finite State Machine using C++

Day 06 to 09: Coding Problems

Bit Manipulation

The State of State

Uses of C++ In Real World

Liskov’s Substitution Principle | SOLID as a Rock