Notes on Data Compression: Part 2

Notes on Data Compression: Part 2

Introduction

In the previous article I did a quick review of some lossless data compression techniques, including run-length, Huffman and Arithmetic coding. We will continue this look, in this article, at lossless techniques with a discussion of the Lempel-Ziv algorithms before ramping things up a little and focussing in some detail on the LZW variant. An implementation of the LZW algorithm will be covered in the article to follow this one.

LZ Adaptive Coding

The Lempel-Ziv family of compression techniques are what are known as 'substitutional' compression algorithms. They are characterised by the use of 'codewords' to substitute for 'strings' of symbols, which then reference a store to reconstruct the strings on decoding. The methods have some advantages over other methods in that they are 'universal'; i.e. they make no assumption about the data sets symbol probability distributions, and are inherently adaptive. They also automatically encode the 'model' table within the encoded data, so it can be reconstructed by the decoder, without additional data transmission.

There are two main flavours of LZ algorithm; LZ1 (also known as LZ77—"A Universal Algorithm for Sequential Data Compression, ", Ziv et. al., 1977) and LZ2 (also known as LZ78—"Compression of Individual Sequences via Variable-Rate Coding", Ziv et. al., 1978). The first of these is conceptually the simplest to understand, but can be slightly more complicated to implement in software (or indeed logic). In brief, the last n symbols are kept in a buffer. As new symbols arrive, they are appended to the current 'string' of symbols. If the string is in the buffer, then the string is maintained and the buffer updated with the new symbol. Once the string has grown to a point where it is no longer in the buffer, the pre-appended string (that was in the buffer) is substituted for a code which consists of a pointer into the buffer where the string occurs, and a run-length indicating how long the string was. The symbol that caused the search to fail now becomes the single symbol string, it is appended to the buffer, and the process continues. Since only the last n symbols are remembered, the buffer can be thought of as a window on the data, n symbols wide, sliding along the data set. Hence you will often find this method called the Sliding Window technique. The buffer is a history of the last n symbols, and so is often referred to as a History Buffer. In its purest form all strings are encoded as pointer/run-length codes, but this is inefficient if the string is a single symbol. Variants might allow single characters to be encoded as 'literals', but then the pointer/run-length codes must be distinguishable from literals. So this is quite a simple method. The difficulty is in determining whether a particular string is in the history buffer, and ensuring the search is exhaustive. This method is, in my experience, easier to do well in logic than in software, where Context Addressable Memory can be implemented, if not easily, at least with some confidence. More on this later.

The second method (LZ2) is similar to LZ1 in its substitutional nature, but uses a 'dictionary' of strings which have previously been seen in the input data, with codewords making references into the dictionary's list of strings, or 'entries.' These codewords are then substituted in place of the strings of symbols.

Initially the dictionary contains only the single symbol strings (e.g. values 0 to 255 for 8 bit bytes). As symbols arrive, they are appended to the current string, and the dictionary is searched to see whether it has an entry for that string. If it does, then a new symbol is appended to the current string. This continues until no entry is found in the dictionary for the current string. When this happens a new entry is assigned a codeword, and the string placed in the dictionary entry for that codeword. The codeword for the last matching string is output, and in the purest LZ2 form, the symbol causing the mismatch is also output. However, in the most popular variant, due to Welch ("A Technique for High Performance Data Compress", Welch, 1984), called LZW, the leftover symbol becomes the current single symbol string, and the process continues.

So this rounds up some major algorithms (excluding run-length encoding, rarely used on its own), implemented in varying degrees in both software and logic implementations. Next I want to look in more detail at one of the more popular LZ algorithms and look at LZW, as a case study, and at what it takes to make it a fully specified and operational algorithm which can be implemented as a logic codec or a software implementation.

The Lempel-Ziv-Welch (LZW) Compression Algorithm

As has been mentioned in the section above, the LZ2 algorithm, in particular the LZW variant, has a dictionary with a number of entries. Each entry has a codeword associated with it, with a unique value. Each entry refers to a string of symbols that have either been seen in the incoming data stream before, or are the single symbol strings that the dictionary was initialised with.

From now on, symbols will be 8 bit bytes, with values 0 to 255. It doesn't matter if these bytes represent ASCII codes, or object data or whatever. The entries, with their associated codewords, which represent the single byte strings are known as root entries (or root codewords). To make this algorithm practical we will have to start imposing some restrictions on what is allowed. For example, we cannot keep building new entries in our dictionary indefinitely and it is usual, though not universal, to restrict codewords to be 12 bits wide, giving a possible 4096 codewords, 256 of which must be the root codewords. The other codewords remain unassigned at the start of a compression procedure and we are free to assign them to particular strings as we choose.

In our previous discussions, no limit has been placed on the size of the strings that the dictionary may refer to. There are two reasons why we might want to restrict this. The first is to do with the fact that our dictionary will be held in memory, and we cannot allow the dictionary to grow unpredictably large. Secondly, for reasons that we will come on to, on decompression, we will have to hold entire strings together before outputting the bytes that make them up, and we will want to limit the holding store used to do this. These are practical restrictions, rather than algorithm necessities, but realising the algorithm in tangible form is what we're aiming to do. Now, as each string in a dictionary is one that we have seen in the input data stream, and we only add new entries when we see a string which has not been seen before (but we do always add it) then it must always be the case that a new entry consists of a string that already has an entry in the dictionary, plus a single byte which caused the string to be 'new.' Therefore, we can always make a new dictionary entry by storing a byte value and a pointer to another entry in the dictionary, where the entry pointed to, appended with the byte value, equals the new string being added. If a codeword is 12 bits, and the byte takes 8 bits, all 4096 entries in the dictionary will need 80K bits in total. This, then, gives us a predictable maximum dictionary size which is a manageable quantity.

Now I've gone through some of the practical aspects of implementing this algorithm before embarking on describing the actual algorithm in detail, because the algorithm is not independent of practical considerations. It was designed to be implemented in digital form, and the way it is described is based on these practical concerns. So the next section describes the algorithm in more general terms, but you should see that the method is shaped by the restrictions we have placed on our implementation. We will first look at compression, working through an example to illustrate what the process is. The example we will use is of encoding the 14 byte message:

RINGADINGDING.

Compression

The table below shows the sequence of operations when compressing the example. It is assumed that, at the start of the compression, the dictionary is in its initial state; i.e. it contains entries for the 256 single byte strings only. So as to avoid getting confused with actual numbers, the following convention will be used. All of the bytes will be represented by their ASCII character (e.g. R). A codeword for a string will have its string surrounded with parentheses; e.g. (IN) is the codeword for the string 'IN'. At this point it won't matter what value is assigned to the codewords, so long as it is understood that they will all be different. The table below shows the 14 steps in the algorithm's procedure.

No alt text provided for this image

From the table above, the first character input is 'R', as shown in the leftmost column. Since this is the first input of any kind, the current string we are working with must be an empty (or NULL) string. Adding the 'R' now makes it a single character string, as shown in the next column. Is this string in the dictionary? Well, yes. All single character strings are always in the dictionary; that is how it is initialised. The third column indicates this with a tick. When a match is found in the dictionary, no new entry is built (the matched entry already exists), and nothing is output (under normal circumstances). The next character is input, 'I', and the current working string becomes "RI". This string is not in the dictionary. We haven't built any new entries yet, so only single character strings can match. The dot in the third column shows that no match is to be found. So, a new string has been seen, and an entry must be built for it.

From an algorithm point of view, the entire string could be stored in any manner, so long as it can be matched against (however inefficiently) at some future point. However, we're working toward a practical implementation, and by introducing a representation (that isn't strictly necessary) now, will save us a lot of trouble later. Since the entry to be built is for a string which must be composed of a string already in the dictionary, plus one character, then all that is needed to store the new string is the codeword for the string already in the dictionary, and the character which, when appended, makes the new string. If it is not clear to you why this works, remember, we keep pulling in characters from the input, appending them to the current string, until the current string is not in the dictionary. Therefore, the current string at that point must be one character longer than a string with a dictionary entry. Thus, shown in column 4 of the table, the new entry is shown as '(R)I', meaning the codeword for the string "R" (a single character string in this case), followed by the character 'I', representing the string "RI". At this point output is required. On a mismatch we output the codeword for the string last matched against. This is the same codeword as was added in the new entry we built (in this case (R) ).

Now, you might ask, why not output the codeword for the new entry we just built, as this represents all the input we have had so far? Well, when we decompress the data, we'd be in the unfortunate position of receiving a codeword for a string which wasn't in the dictionary. For example, (RI) would be the first codeword for decompression which clearly cannot exist in a dictionary which is initialised only for single character strings. As we shall see later, even if we don't output new entry codewords as we build them, this problem is not entirely avoided. More of this later. So (R) is output, which still leaves the input character 'I' unrepresented, and so this becomes our new string, lest we forget about it.

The next input character is 'N', making a new current string of IN. There is no match in the dictionary, and so '(I)N' is added, as for '(R)I' previously. The codeword '(I)' is output, and the new string becomes "N". Now this procedure remains the same for the next four input characters (G, A, D, I) where four new entries are built ((N)G, (G)A, (A)D and (D)I) and four codewords are output ((N), (G), (A), (D)).

Then, something more interesting happens (at last!). We have "I" as our unprocessed string, and 'N' as our input character, giving us "IN", which is (hurrah!) in the dictionary, built at row 3. Now all we have to do is remember the entry for "IN" (stored as (I)N) and input the next character. No codeword is output, and no new entry is built. The next character input is 'G', giving us "ING" as our current string, and this is not in the dictionary. In fact, no three character entries have yet been built. The procedure for a failed match applies once more. A new entry is built, stored as (IN)G. That is, the entry number for (IN) is stored, along with the character value for 'G'. The string we have matched is output (i.e. (IN)); the first dynamically generated entry codeword. And the current string is set to "G".

The rest of the input is processed in the same manner until the end. There is one further match to a dynamically generated dictionary entry, and it is processed in the same manner as described above. The only thing left to clear up is what to do with the end of the data stream. It will always be the case that at the end of processing there will be a string that is matched in the dictionary, but has not been represented in the output. This could be a single character string, or, as in our example a match with a dynamically allocated entry. Since the normal procedure is to carry it forward for appending the next input character, we must do something different, since no more input is forthcoming. In this case we simply 'pretend' that there is no match, and output the codeword for the current outstanding string. No entry is built, and the current string is effectively set to the null string. All the input has been represented in the output in an encoded form which must, therefore, contain enough information about the original data to recover it. It doesn't explicitly contain the dictionary, but, as we shall soon see, this too has been encoded in the output data; a feature which is one of the major attractions to this class of algorithm. So, without further ado, let's decompress the encoded data, and see what happens.

Decompression

In decompression, the input must consist of the codewords we have previously generated as the compressed output, in the order in which we generated them. These will be, as in our compression example, a mixture of root codewords and dynamically allocated codewords. We can only start with an initialised dictionary, as the final compression dictionary has not been stored with the data. This restricts the point at which decompression can begin within a compressed data set. Only at points where the dictionary was in its initial state during compression can decompression begin. In the simple examples here, this means at the beginning of the data.

The table below shows the decompression sequence for our previously compressed example. The first six codewords are all root codewords, and their byte values may be sent directly to the output. At each stage (except for the first codeword) a dictionary entry is built with a byte value of the current input codeword, and a pointer to the previously processed codeword's dictionary entry. I.e. an entry representing the last string processed with the current codeword's byte value appended to it. Hence the first entry built is (R)I, since, when (I) is input, the last codewords processed was (R) and the byte value of (I) is 'I'.

No alt text provided for this image

The seventh codeword input is (IN); a non-root codeword. It already has an entry in the dictionary at this point, as show on the third line in the figure. This entry has a byte value of 'N', and so this is placed on a 'LIFO'. LIFO stands for 'last in first out'. As we follow this example through, you will see that the byte values recovered in a multi-byte codeword are in reverse order. Therefore we must store them until the entire string is decoded, and then output them in the correct order. A 'LIFO' is used, where we can 'push' bytes into it, and then 'pop' them when we have all the bytes-the last byte pushed being the first byte popped.

The entry for (IN) also has a pointer to another entry; in this case it is the root entry (I). Its byte value is also pushed onto the LIFO and, as it is a root codeword, the contents of the LIFO are flushed to the output. At this time, an entry is built using the same rules as before. The new entry has a pointer to the previously processed codeword (i.e. (D)) and a byte value equal to the current codeword's last pushed byte value. Since we terminated the link list traversal pointing to (I), the byte value is 'I', as shown in the figure. The rest of the input codewords are processed in a similar manner, with the entire original data recovered, and the dictionary left in the same state as at the end of compression.

So, we have managed to recover the data, without loss of information, and without storing the dictionary—all was encoded uniquely within the compressed data itself. And that would be all there is to it, except there is one slight twist in the tale. There is a circumstance where, during decompression, a codeword is encountered which has no entry in the dictionary! This is the KωK exception case.

The KωK Exception Case

Welch describes in his paper a particular corner case in the LZW algorithm which, if processed as described above, will cause a breakdown of the procedure. It is called the KωK (kay-omega-kay) exception case, since inputting the sequence KωK, where K is any single character and ω is a string for which there is a dictionary entry, causes a codeword for the string KωK to be output (during compression) at the same time as it is built in the dictionary. This causes a problem on decompression since no entry will have been built on encountering it, as entries are built after the codeword is processed. More properly, the sequence KωKωK causes the problem, and the clue to its solution is in the fact that the entry needed is the very next entry we would have built at the end of processing the codeword causing the problem. So, we might have enough information to construct the missing entry uniquely, and recover the situation.

To see how, an example is in order. The simplest KωK sequence is the repeated character. This is because K can be any character, and all single character strings have entries in the dictionary by default. The table below shows a repeated character compression sequence.

No alt text provided for this image

The procedure followed is, I hope you can see, exactly that in the previous compression example. Entries are built when no match is found, and the last character input is carried forward as the new current string. Now, if we are presented with the codeword sequence shown in the output column of the figure, what will happen? We can process the first codeword as it is a root codeword with a known byte value. No entry can be built, as no previous codeword has been processed. The next codeword is (KK). But this is a dynamically allocated codeword, and we haven't built any entries yet. There are a couple of slightly different methods for detecting, and then correcting this problem, but, basically, the problem is detected if the value of the codeword is the same as the next dictionary entry that will be built (and they are built in ascending order). It is corrected by placing the byte value of the last processed codeword onto the LIFO, and re-submitting the last codeword value again in place of the current problematic one. This will, once the FIFO is flushed, place the right characters on the output, and will result in the 'missing' dictionary entry being built, using the normal procedures. An alternative exists where the missing entry is built first, but the result is just the same, and it is left to the reader to figure out the details. This final detection and correction works for all KωK cases, and ties up the last lose end, as far as the basic algorithm goes.

Describing an algorithm and manipulating symbols on paper shows how an algorithm might work, but we haven't arrived yet at a method which is fully practical. There are still some details to consider before the above can become a working piece of software, or a logic implementation. Some additional (fortunately much simpler) algorithms need to be roped in, and some design constraints and limits placed on the algorithm to map the symbolic representation onto more practical units of information, such as bytes. But that must wait for the next article.

要查看或添加评论,请登录

Simon Southwell的更多文章

  • Nested Vectored Interrupts and Co-Simulation

    Nested Vectored Interrupts and Co-Simulation

    Introduction In general, co-simulation, in the context discussed in this article, is the ability to simulate in both…

    1 条评论
  • Instruction Set Simulators, gdb, IDEs and Co-simulation

    Instruction Set Simulators, gdb, IDEs and Co-simulation

    Introduction I have previously discussed instruction set simulators (ISS) in various contexts, including a discussion…

    1 条评论
  • The Python/C Interface

    The Python/C Interface

    Introduction In this article I want to talk about the Python C interface. Now that can be one of two directions, of…

    2 条评论
  • Logic Development and Make

    Logic Development and Make

    Introduction The ‘make’ utility has been around a long time now (since 1976), and my relationship with it isn’t much…

    3 条评论
  • Ethernet and TCP/IP

    Ethernet and TCP/IP

    Introduction I have had a few requests to cover something about ethernet over the last few months and so I’ve finally…

    1 条评论
  • Performance Measurements of VProc on Verilator

    Performance Measurements of VProc on Verilator

    Introduction I have written about the VProc virtual processor before, in terms of what it is, what it is used for, how…

    5 条评论
  • Introduction to USB: Part 5

    Introduction to USB: Part 5

    Introduction In the first 4 articles in this series, we got ourselves to a point just shy of transferring data in USB…

    1 条评论
  • Introduction to USB: Part 4

    Introduction to USB: Part 4

    Introduction In the previous article we looked at the USB 3 physical layer, looking at encoding, scrambling and the…

  • Introduction to USB: Part 3

    Introduction to USB: Part 3

    Introduction In the previous two articles (see here and here) we got up to USB 2.1 with a half-duplex differential pair…

    1 条评论
  • The VProc Virtual Processor VIP

    The VProc Virtual Processor VIP

    Introduction I have written about the VProc virtual processor verification IP in a previous article, which was the…

    1 条评论

社区洞察

其他会员也浏览了