LZW compression is brilliant

Most of the text and image files we handle are full of redundant information, and can (and should) be made much smaller without losing anything, which cuts down the storage needed or the time taken to transmit them, or both.

One way to do this is to analyse the file, notice commonly repeated patterns, and replace them by specific codes. But to do this you need to have the whole file, or a representative chunk of it, available before you start processing it. Sometimes you have a stream of characters coming at you and you need to compress them on the fly. The LZW algorithm provides a really neat way of doing this. I’ve been digging into this (for a program to create gif files, as it happens) and now I understand it I’m so impressed I want to share it.

The input characters will have a certain size. For ASCII text this is 8 bits. For a graphics you might use 4 bits to cover 16 possible colours for a simple image, or more if you’re being artistic. The codes output will each have a size which must be bigger than the input size. Let’s suppose, to keep it simple, you’re compressing a stream of 4 bit colours and using 7 bits each for the codes.

The coder has an input stream of characters, an output stream, a buffer, which is a string of input characters and is initially empty, and a dictionary which matches codes to character strings. This is initialised with predefined codes for strings of just 1 character, which are just the same with additional zeros. So in our example the characters 0 thru 15 are represented by the codes x00 thru x0F. Also code x10 is known as CLEAR and x11 is known as STOP. The remaining 238 codes, x12 thru x7F, can be defined by the user.

The coding algorithm is very simple

  1. Get the next character from the input
  2. Consider the string formed by appending this character to the buffer. Is it in the dictionary?
  3. If so, add the character to the string, and go back to step 1
  4. If not, add this string to the dictionary using the next available user code. Output the code corresponding to the existing buffer, clear the buffer and replace it by the single character, and go back to step 1
  5. When there are no more input characters, output the code for the string in the buffer and then the `STOP code

It may seem odd to output the old code at step 4 rather than the new one you’ve just defined, but in fact this is very cunning, as we shall see.

Let’s see how this works in practice. Suppose we are encoding a picture which starts at the top with a lot of blue pixels of sky, and occasional white cloud pixels. Suppose that blue is colour 3 and white is colour 1.

  • The first pixel is 3, blue. The buffer is empty, so we consider the string <3>. Is it in the dictionary? Yes, it is one of the single-character predefined codes, so we just add it to the buffer.
  • We grab the next input, which is another 3. Is <33> in the dictionary? No, so we add it to the dictionary as code x12, the first user-defined code, output the code x03 for the buffer and replace it with the second character, <3>
  • Now we get the 3rd input, another 3. Is <33> in the dictionary? Yes, we’ve just added it. So we put <33> in the buffer.
  • The next input is another 3. <333> is not in the dictionary so we add it as x13, output the x12 code for <33>, and revert to <3> in the buffer.
  • More blue sky characters take the buffer to <33> and <333> but at <3333> we have to define a new code, output x13 for <333> and the buffer reverts to the single character <3>

So as the monotonous blue sky continues, we output codes representing increasingly long strings of blue pixels,. Eventually we encounter a character 1, the start of a white cloud. We output the code for <333…33>, define a (probably useless) code for <333..331>, stick the <1> in the buffer, and start defining codes for <11>, <111> and so on. “Emerging from the cloud we are back with strings of 3s for which we already have relevant codes in the dictionary, we don’t have to re-learn them.

So this is fine. For images with large areas of the same colour (or, indeed, for common repeated patterns) the method will build up a dictionary of useful codes which will achieve high compression: the extra bits in the length of the code are more than compensated for by the fact that a code represents long strings of characters. The 7 bit codes in our example will in practice have to be packed into conventional 32 bit computer words, which is tedious but straightforward. Our large image file of characters is compressed into a much smaller file of codes, which can be saved to disk or sent over the network.

But, you are probably wondering, what about the dictionary? Surely any program which is going to use this compressed file, perhaps to display it on a screen, needs to be sent the dictionary so it can interpret what all these user-defined codes mean? That’s going to be quite a size – and you would think that the dictionary has to be sent ahead of the data which it is going to be expanded.

The punch line, the beautiful cunning of the method, is that you don’t have to. The sequence of compressed codes contains in itself enough information for the interpreting program, the decoder, to rebuild the dictionary on the fly, as it processes the data.

Notice that at step 4 in the encoding process a code is output and a new code is defined. So whenever the decoder reads a code, it has to define a code for the input dictionary. This code is always the next available user code. So in our example the first code to be defined will be x12, then x13, and so on. The string it matches will be one character longer than that of the code just input, and the first characters will be those of that code: only the final one is unknown. And that will be the first character of whatever string is read next.

`So the decoding procedure is also very simple

  1. Input a code
  2. Except for the very first code, set the final character of your incomplete dictionary definition to be the first character from this code
  3. Process this code – plot the characters on the screen, or output them to the expanded file, as appropriate
  4. Create a new dictionary entry for the next available user code, with string length one element longer than the string for this code, and fill all elements but the last from the code you’ve just read.


In our example, we first read the code x03 which is predefined as <3>. We define a dictionary entry for code x12, the first user code, as <3?>. The second code is x12 which enables us to complete this definition as <33>, while also creating a dictionary entry for x13 as <33?>. And so it continues.

Notice how in this instance the code x12 is being used to define itself . Which is why step 3 has to come after step 2. The code is not completely defined, but the important first character is.

`What happens if you run out of possible user codes? Maybe the picture is large or complicated and the 238 codes we get from 7 bits is not enough. Of course with careful planning you will have allowed enough space, but programmers are not always infallible. There are two options for dealing with this.

The simple choice is to use the CLEAR code. You flush all the user defined codes from the dictionary and start over. This may be appropriate if your blue/white sky at the top of the picture becomes a green/brown landscape towards the bottom. The decoder receives the code and likewise flushes its dictionary and builds a new one.

The better choice is to expand the code size. When all the 7 bit codes are used up, the encoder switches to producing 8 bit codes on the output. The decoder can recognise this: if all the codes are used up and there has been no CLEAR code sent it will assume that subsequent input codes are 1 bit longer. This involves a little more programming, but it’s usually worth the effort.

LZW stands for Lempel, Ziv and Welch, by the way. These three developed the method back in the dark ages of the 1980’s. They patented it – let’s hope they made lots of money, they deserve it – but these have now expired, leaving us free to use their nice algorithm without worries about getting sued.


Why computing can be complicated

It is amazing how simple computation can have profound complexity once you start digging.

Let’s take a simple example: finding the average (arithmetic mean) of a set of numbers. It’s the sort of thing that often turns up in real life, as well as in class exercises. Working from scratch you would write a program like (using C as an example: Python or Matlab or other languages would be very similar)

float sum=0;
for(int j=0;j<n;j++){
   sum += x[j];
float mean=sum/n;

which will compile and run and give the right answer until one day, eventually, you will spot it giving an answer that is wrong. (Well, if you’re lucky you will spot it: if you’re not then its wrong answer could have bad consequences.)

What’s the problem? You won’t find the answer by looking at the code.

The float type indicates that 32 bits are used, shared between a mantissa and and exponent and a sign, and in the usual IEE754 format that gives 24 bits of binary accuracy, corresponding to 7 to 8 decimal places. Which in most cases is plenty.

To help see what’s going on, suppose the computer worked in base 10 rather than base 2, and used 6 digits. So the number 123456 would be stored as 1.23456 x 105 . Now, in that program loop the sum gets bigger and bigger as the values are added. Take a simple case where the values all just happen to be 1.0. Then after you have worked through 1,000,000 of them, the sum is 100000, stored as 1.00000 x 106 . All fine so far. But now add the next value. The sum should be 1000001, but you only have 6 digits so this is also stored as 1.00000 x 106 . Ouch – but the sum is still accurate to 1 part in 106 . But when you add the next value, the same thing happens. If you add 2 million numbers, all ones, the program will tell you that their average is 0.5. Which is not accurate to 1 part in 106 , not nearly!

Going back to the usual but less transparent binary 24 bit precision, the same principles apply. If you add up millions of numbers to find the average, your answer can be seriously wrong. Using double precision gives 53 bit precision, roughly 16 decimal figures, which certainly reduces the problem but doesn’t eliminate it. The case we considered where the numbers are all the same is actually a best-case: if there is a spread in values then the smallest ones will be systematically discarded earlier.

And you’re quite likely to meet datasets with millions of entries. If not today then tomorrow. You may start by finding the mean height of the members of your computing class, for which the program above is fine, but you’ll soon be calculating the mean multiplicity of events in the LHC, or distances of galaxies in the Sloan Digital Sky Survey, or nationwide till receipts for Starbuck’s. And it will bite you.

Fortunately there is an easy remedy. Here’s the safe alternative

float mean=0;
for(int j=0;j<n;j++){
     mean += (x[j]-mean)/(j+1);

Which is actually one line shorter! The slightly inelegant (j+1) in the denominator arises because C arrays start from zero. Algebraically they are equivalent because

but numerically they are different and the trap is avoided. If you use the second code to average a sequence of 1.0 values, it will return an average of 1.0 forever.

So those (like me) who have once been bitten by the problem will routinely code using running averages rather than totals. Just to be safe. The trick is well known.

What is less well known is how to safely evaluate standard deviations. Here one hits a second problem. The algebra runs

where the n/(n-1) factor, Bessel’s correction, just compensates for the fact that the squared standard deviation or variance of a sample is a biassed estimator of that of the parent. We know how to calculate the mean safely, and we can calculate the mean square in the same way. However we then hit another problem if, as often happens, the mean is large compared to the standard deviation.

Suppose what we’ve got is approximately Gaussian (or normal, if you prefer) with a mean of 100 and a standard deviation of 1. Then the calculation in the right hand bracket will look like

10001 – 10000

which gives the correct value of 1. However we’ve put two five-digit numbers into the sum and got a single digit out. If we were working to 5 significant figures, we’re now only working to 1. If the mean were ~1000 rather than ~100 we’d lose two more. There’s a significant loss of precision here.

If the first rule is not to add two numbers of different magnitude, the second is not to subtract two numbers of similar magnitude. Following these rules is hard because an expression like x+y can be an addition or a subtraction depending on the signs of x and y.

This danger can be avoided by doing the calculation in two passes. On the first pass you calculate the mean, as before. On the second pass you calculate the mean of (x-μ)2 where the differences are sensible, of order of the standard deviation. If your data is in an array this is pretty easy to do, but if it’s being read from a file you have to close and re-open it – and if the values are coming from an online data acquisition system it’s not possible.

And there is a solution. It’s called the Welford Online Algorithm and the code can be written as a simple extension of the running-mean program above

 // Welford's algorithm
float mean=x[0];
float V=0;
for(int j=1;j<n;j++){
     float oldmean=mean;
     mean += (x[j]-mean)/(j+1);
     V += ((x[i]- mean)(x[i]-oldmean) - V)/j
float sigma=sqrt(V);

The subtractions and the additions are safe. The use of both the old and new values for the mean accounts algebraically, as Welford showed, for the change that the mean makes to the overall variance. The only differences from our original running average program are the need to keep track of both old and new values, and initially defining the mean as the first element (zero), so the loop starts at j=1, avoiding division by zero: the variance estimate from a single value is meaningless. (It might be good to add a check that n>1 to make it generally safe).

I had suspected such an algorithm should exist but, after searching for years, I only found it recently (thanks to Dr Manuel Schiller of Glasgow University). It’s beautiful and its useful and it deserves to be more widely known.

It is amazing how simple computation can have profound complexity once you start digging.