Most of the text and image files we handle are full of redundant information, and can (and should) be made much smaller without losing anything, which cuts down the storage needed or the time taken to transmit them, or both.
One way to do this is to analyse the file, notice commonly repeated patterns, and replace them by specific codes. But to do this you need to have the whole file, or a representative chunk of it, available before you start processing it. Sometimes you have a stream of characters coming at you and you need to compress them on the fly. The LZW algorithm provides a really neat way of doing this. I’ve been digging into this (for a program to create gif files, as it happens) and now I understand it I’m so impressed I want to share it.
The input characters will have a certain size. For ASCII text this is 8 bits. For a graphics you might use 4 bits to cover 16 possible colours for a simple image, or more if you’re being artistic. The codes output will each have a size which must be bigger than the input size. Let’s suppose, to keep it simple, you’re compressing a stream of 4 bit colours and using 7 bits each for the codes.
The coder has an input stream of characters, an output stream, a buffer, which is a string of input characters and is initially empty, and a dictionary which matches codes to character strings. This is initialised with predefined codes for strings of just 1 character, which are just the same with additional zeros. So in our example the characters 0 thru 15 are represented by the codes x00 thru x0F. Also code x10 is known as CLEAR and x11 is known as STOP. The remaining 238 codes, x12 thru x7F, can be defined by the user.
The coding algorithm is very simple
- Get the next character from the input
- Consider the string formed by appending this character to the buffer. Is it in the dictionary?
- If so, add the character to the string, and go back to step 1
- If not, add this string to the dictionary using the next available user code. Output the code corresponding to the existing buffer, clear the buffer and replace it by the single character, and go back to step 1
- When there are no more input characters, output the code for the string in the buffer and then the `STOP code
It may seem odd to output the old code at step 4 rather than the new one you’ve just defined, but in fact this is very cunning, as we shall see.
Let’s see how this works in practice. Suppose we are encoding a picture which starts at the top with a lot of blue pixels of sky, and occasional white cloud pixels. Suppose that blue is colour 3 and white is colour 1.
- The first pixel is 3, blue. The buffer is empty, so we consider the string <3>. Is it in the dictionary? Yes, it is one of the single-character predefined codes, so we just add it to the buffer.
- We grab the next input, which is another 3. Is <33> in the dictionary? No, so we add it to the dictionary as code x12, the first user-defined code, output the code x03 for the buffer and replace it with the second character, <3>
- Now we get the 3rd input, another 3. Is <33> in the dictionary? Yes, we’ve just added it. So we put <33> in the buffer.
- The next input is another 3. <333> is not in the dictionary so we add it as x13, output the x12 code for <33>, and revert to <3> in the buffer.
- More blue sky characters take the buffer to <33> and <333> but at <3333> we have to define a new code, output x13 for <333> and the buffer reverts to the single character <3>
So as the monotonous blue sky continues, we output codes representing increasingly long strings of blue pixels,. Eventually we encounter a character 1, the start of a white cloud. We output the code for <333…33>, define a (probably useless) code for <333..331>, stick the <1> in the buffer, and start defining codes for <11>, <111> and so on. “Emerging from the cloud we are back with strings of 3s for which we already have relevant codes in the dictionary, we don’t have to re-learn them.
So this is fine. For images with large areas of the same colour (or, indeed, for common repeated patterns) the method will build up a dictionary of useful codes which will achieve high compression: the extra bits in the length of the code are more than compensated for by the fact that a code represents long strings of characters. The 7 bit codes in our example will in practice have to be packed into conventional 32 bit computer words, which is tedious but straightforward. Our large image file of characters is compressed into a much smaller file of codes, which can be saved to disk or sent over the network.
But, you are probably wondering, what about the dictionary? Surely any program which is going to use this compressed file, perhaps to display it on a screen, needs to be sent the dictionary so it can interpret what all these user-defined codes mean? That’s going to be quite a size – and you would think that the dictionary has to be sent ahead of the data which it is going to be expanded.
The punch line, the beautiful cunning of the method, is that you don’t have to. The sequence of compressed codes contains in itself enough information for the interpreting program, the decoder, to rebuild the dictionary on the fly, as it processes the data.
Notice that at step 4 in the encoding process a code is output and a new code is defined. So whenever the decoder reads a code, it has to define a code for the input dictionary. This code is always the next available user code. So in our example the first code to be defined will be x12, then x13, and so on. The string it matches will be one character longer than that of the code just input, and the first characters will be those of that code: only the final one is unknown. And that will be the first character of whatever string is read next.
`So the decoding procedure is also very simple
- Input a code
- Except for the very first code, set the final character of your incomplete dictionary definition to be the first character from this code
- Process this code – plot the characters on the screen, or output them to the expanded file, as appropriate
- Create a new dictionary entry for the next available user code, with string length one element longer than the string for this code, and fill all elements but the last from the code you’ve just read.
In our example, we first read the code x03 which is predefined as <3>. We define a dictionary entry for code x12, the first user code, as <3?>. The second code is x12 which enables us to complete this definition as <33>, while also creating a dictionary entry for x13 as <33?>. And so it continues.
Notice how in this instance the code x12 is being used to define itself . Which is why step 3 has to come after step 2. The code is not completely defined, but the important first character is.
`What happens if you run out of possible user codes? Maybe the picture is large or complicated and the 238 codes we get from 7 bits is not enough. Of course with careful planning you will have allowed enough space, but programmers are not always infallible. There are two options for dealing with this.
The simple choice is to use the CLEAR code. You flush all the user defined codes from the dictionary and start over. This may be appropriate if your blue/white sky at the top of the picture becomes a green/brown landscape towards the bottom. The decoder receives the code and likewise flushes its dictionary and builds a new one.
The better choice is to expand the code size. When all the 7 bit codes are used up, the encoder switches to producing 8 bit codes on the output. The decoder can recognise this: if all the codes are used up and there has been no CLEAR code sent it will assume that subsequent input codes are 1 bit longer. This involves a little more programming, but it’s usually worth the effort.
LZW stands for Lempel, Ziv and Welch, by the way. These three developed the method back in the dark ages of the 1980’s. They patented it – let’s hope they made lots of money, they deserve it – but these have now expired, leaving us free to use their nice algorithm without worries about getting sued.