Ouch my eye!

Do not look at LASER with remaining eye!


Just a trim off the top

With luck this will be the final post on the Bellard-LZSS compressor, allowing us to sign off on the last piece of the puzzle required to write a PIC encoder for the Railroad Tycoon Deluxe (RRDX) variant of the MicroProse PIC file format PIC93. Last time we left off having found a few bugs, and some implementation variances that resulted in our output being different than that of the reference file. We finished off by having our primary test file pass a full validation. However upon looking at a few others we were still failing on those. So in this post we pick-up where the last one left off and try to find whatever other bugs and variances remain.

Getting our bearings

First thing to do is determine what the state of things are for what we have. While I looked at a couple of files at the end of the last post, time to look at all of our small sample set. First thing I did was to make a set of new test scripts based on what we wrote previously for testing the LZW compressor / PIC88 encoder. I won’t detail the scripts here, they are essentially exactly the same, with just the program names and file extensions changed. Those scripts allow us to quickly test all matching files in a given directory (or directory tree) and produced the following results for our testing directory we’ve been working with.

  LABS-3.LZSS: a078e2a3c16ecbdff62a242ad2094219  [FAIL]
  LABS-2.LZSS: 3dacf6a88ca588bacbfed2c20a31b341  [FAIL]
  LABS-1.LZSS: 3dacf6a88ca588bacbfed2c20a31b341  [FAIL]
DIFFS0-0.LZSS: aa04f99c36b09d120976387bc401d012  [FAIL]
  LABS-0.LZSS: a2ec6535216d220678fc648b9447bcd4  [pass]

Well that’s not good, we’re failing on ALL the other files. Let’s start by looking at the sister LABS-x files to the file we’ve been working with so far. This amazes me that we just happened to start with the one “easy” file. While I do suspect what is going on here, and did hint at it in my last post, let’s see what is actually going on here.

First thing I noticed is that the reference sources are all one byte longer than the version we generate. After running a compare we get the following. This size difference certainly aligns with my suspicions. Also note that the reference files all have an even size value.

Reference Files:
-rw-r--r--  2376 27 Jun 12:32 val/LABS-0.LZSS
-rw-r--r--  1888 27 Jun 12:32 val/LABS-1.LZSS
-rw-r--r--  1888 27 Jun 12:32 val/LABS-2.LZSS
-rw-r--r--   428 27 Jun 12:32 val/LABS-3.LZSS

Generated Files:
-rw-r--r--  2376  4 Jul 12:20 val/LABS-0.RLZ
-rw-r--r--  1887  4 Jul 12:20 val/LABS-1.RLZ
-rw-r--r--  1887  4 Jul 12:20 val/LABS-2.RLZ
-rw-r--r--   427  4 Jul 12:20 val/LABS-3.RLZ

Compare Results:
cmp: EOF on val/LABS-1.RLZ
cmp: EOF on val/LABS-2.RLZ
cmp: EOF on val/LABS-3.RLZ

So the difference is at the very end of the file, and that strongly supports my suspicions. Taking a look there I can see what looks to be an extra byte of data after the compressed stream EOF marker of 00 f0 00. As I had suspected, this is likely just an uninitialized padding byte. If you remember these files are individual compressed planes extracted out of a PIC93 file, and perhaps they want word alignment within the file (makes sense). Will have to make a note of that for when we do write a PIC93 encoder.

File: val/LABS-1.LZSS  [1888 bytes]
Offset    x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF  Decoded Text
0000000x: 55 55 FF FF F8 FC FF F8 FC FF F8 FC FF F8 FC FF  U U · · · · · · · · · · · · · ·
0000001x: F8 FC FF F8 FC FF F8 FC 55 55 FF F8 FC FF F8 FC  · · · · · · · · U U · · · · · ·
0000002x: FF F8 FC FF F8 FC FF F8 FC FF F8 FC FF F8 FC FF  · · · · · · · · · · · · · · · ·
0000003x: F8 FC BB F7 FF F8 EC E0 B0 F8 4E 00 3F B1 F8 4C  · · · · · · · · · · N · ? · · L
        ⋮                                                 ⋮
0000073x: FC FF F8 FC FF F8 FC FF F8 FC AA AA FF F8 FC FF  · · · · · · · · · · · · · · · ·
0000074x: F8 FC FF F8 FC FF F8 FC FF F8 FC FF F8 FC FF F8  · · · · · · · · · · · · · · · ·
0000075x: FC FF F8 FC 0A 00 FF F8 FC FF F8 C8 00 F0 00 AC  · · · · · · · · · · · · · · · ·

00 F0 00: LZSS EOF marker

After removing the padding byte on the LABS-x.LZSS files we get the following. Seems likely that the files (or rather planes within the file) are indeed always padded out to a 16bit boundary, meaning any odd-length compressed plane would have a single byte appended to it. Given that the files always have an even byte length, and the ones that appear to have padding, trim down to an odd length. We can also see that the extra byte is not processed by the decompressor, when it exits it shows one byte remaining in the input file, when we run an untrimmed file, so should be safe to remove.

Bellard-LZSS Decompressor
Decompressing: 'val/LABS-1.LZSS'	File Size: 1888 bytes
Creating: 'val/LABS-1.OUT'
consumed: 1887 / 1888
wrote 32000 bytes

So with that let’s go and trim the padding off of all the failing files, if it looks like there is an extra byte after the 00 f0 00 EOF sequence for the LZSS stream.

-rw-r--r--  2376 27 Jun 12:32 val/LABS-0.LZSS
-rw-r--r--  1887  4 Jul 12:33 val/LABS-1.LZSS
-rw-r--r--  1887  4 Jul 12:34 val/LABS-2.LZSS
-rw-r--r--   427  4 Jul 12:35 val/LABS-3.LZSS

  LABS-3.LZSS: 7ede3c5c0e4a147087ddc0e0d6fdc186  [pass]
  LABS-2.LZSS: 73fe4fca0d51771dd004252920e484ef  [pass]
  LABS-1.LZSS: 73fe4fca0d51771dd004252920e484ef  [pass]
DIFFS0-0.LZSS: aa04f99c36b09d120976387bc401d012  [FAIL]
  LABS-0.LZSS: a2ec6535216d220678fc648b9447bcd4  [pass]

Well that seems to strongly support the padding byte theory/solution, unfortunately that didn’t solve all our failures. What’s going on with that DIFFSx.LZSS file?


Implementation variances

Looking at the file lengths of our reference file and the file we generated here. we are again exactly one byte off but there does not appear to be any padding after the EOF marker. So we need to look as to where the difference is happening.

-rw-r--r--  24170 28 Jun 00:37 val/DIFFS0-0.LZSS
-rw-r--r--  24169  4 Jul 13:26 val/DIFFS0-0.RLZ

cmp: val/DIFFS0-0.LZSS val/DIFFS0-0.RLZ differ: char 665, line 2

okay, so the difference is in the middle of the file, that is more troublesome. Sadly cmp doesn’t help us here much for the position here (lines are meaningless in binary files). Let’s run our instrumented version of the decompressor on the reference and on our generated file and compare the resultant logs.

Diff of decompression logs. Left is reference, right is the newly compressed file

Well that’s interesting. Looks like an edge case, and again possibly an implementation error. The distance value here is exactly 256 bytes. Based on the pointer sizes the format uses, this is reachable via a short pointer, which my code is generating. However for some reason the reference is using a long pointer here. I wonder if MicroProse didn’t code this as 255 instead of 256? Should be an easy enough fix.

        uint16_t distance = (cur - match_pos) & TABLEMSK;
        if((distance <= 255) && (match_len <= 5)) { // short pointer range
            if(match_len < THRESHOLD) { // too short to encode, send as literal

Changing the one value from 256 to 255 in the highlighted line should do the trick. Nothing to do but compile it and give it a try.

-rw-r--r--  24170 28 Jun 00:37 val/DIFFS0-0.LZSS
-rw-r--r--  24170  4 Jul 13:43 val/DIFFS0-0.RLZ

cmp: no differences detected

MD5 (val/DIFFS0-0.LZSS) = aa04f99c36b09d120976387bc401d012
MD5 (val/DIFFS0-0.RLZ)  = aa04f99c36b09d120976387bc401d012

Perfect, now. Another easy fix. Now the question is did it affect our previous results?


Validating our results

Hopefully we’ve seen all the implementation specifics by now. Time to validate what we have against a larger set of files. Though I don’t have too many generated at this time, so the sample size will still be relatively small.

   LABS-3.LZSS: 7ede3c5c0e4a147087ddc0e0d6fdc186  [pass]
   LABS-2.LZSS: 73fe4fca0d51771dd004252920e484ef  [pass]
   LABS-1.LZSS: 73fe4fca0d51771dd004252920e484ef  [pass]
 DIFFS0-0.LZSS: aa04f99c36b09d120976387bc401d012  [pass]
   LABS-0.LZSS: a2ec6535216d220678fc648b9447bcd4  [pass]

For the small sample size, it looks good. let’s try a bigger sample, though will probably take some work due to the padding issue from before. Since this is a different directory, I expect all the same failures as before as those copies have not been trimmed yet.

 STONBRDG-2.LZSS: ed005120a725a53df18cc7b31e62471e  [FAIL]
      PAL-1.LZSS: b0b8969d3350a2c3f8412e974b64f314  [FAIL]
      PAL-0.LZSS: b0b8969d3350a2c3f8412e974b64f314  [FAIL]
 STONBRDG-3.LZSS: 407380a7bb3b6240b01997986caa0d00  [FAIL]
   DIFFS0-3.LZSS: 49a7541dc400b857df7b7b56a3b53772  [FAIL]
     LABS-3.LZSS: a078e2a3c16ecbdff62a242ad2094219  [FAIL]
    TLOGO-0.LZSS: 37b207d883cef9dff3010dc9ea84813c  [FAIL]
    TLOGO-1.LZSS: 65d51b17883a60b23dabf9eb55cd61c9  [FAIL]
     LABS-2.LZSS: 3dacf6a88ca588bacbfed2c20a31b341  [FAIL]
   DIFFS0-2.LZSS: f0d5dfadc5af2d7b3d438aae9a0b82a6  [pass]
    TLOGO-2.LZSS: dea0967bdd7080aad81cba8cab031d88  [pass]
     LABS-1.LZSS: 3dacf6a88ca588bacbfed2c20a31b341  [FAIL]
   DIFFS0-1.LZSS: c280d00dbaaf60aeebf6e8090ef7af2b  [FAIL]
   DIFFS0-0.LZSS: aa04f99c36b09d120976387bc401d012  [pass]
     LABS-0.LZSS: a2ec6535216d220678fc648b9447bcd4  [pass]
    TLOGO-3.LZSS: c29eaeb0b4e098e1e3dc8a026397bd88  [pass]
      PAL-3.LZSS: b0b8969d3350a2c3f8412e974b64f314  [FAIL]
 STONBRDG-0.LZSS: cf6a01635d2677e8dc73458de6727737  [FAIL]
 STONBRDG-1.LZSS: b55b4a3f2b95d2eb1a8ffbd311d1ce55  [FAIL]
      PAL-2.LZSS: b0b8969d3350a2c3f8412e974b64f314  [FAIL]

Ouch, that’s a lot more than I expected, but at least the ones I was expecting to fail are still failing. Sadly there is no easy/automated way to trim the necessary files (or to pad our compressed output) as the extra bytes are random in their value. The only way is to inspect each file, locate the compressed stream EOF marker, and trim off anything that follows. (though I’m sure you grep geniuses out there could do it… I am not a grep genius) So after some manual file editing we end up with.

 STONBRDG-2.LZSS: cff6b40434e480a1cc404912a8b41fb1  [FAIL]
      PAL-1.LZSS: 621953c4d0de754813b0610c1743450e  [pass]
      PAL-0.LZSS: 621953c4d0de754813b0610c1743450e  [pass]
 STONBRDG-3.LZSS: 407380a7bb3b6240b01997986caa0d00  [FAIL]
   DIFFS0-3.LZSS: f7c292a6262fa45142ee428a9fd5821a  [pass]
     LABS-3.LZSS: 7ede3c5c0e4a147087ddc0e0d6fdc186  [pass]
    TLOGO-0.LZSS: e8f33583bd2479f7c1b70adb3405482b  [pass]
    TLOGO-1.LZSS: 2ca3528f3f2fb892fd5ea4b42997af63  [pass]
     LABS-2.LZSS: 73fe4fca0d51771dd004252920e484ef  [pass]
   DIFFS0-2.LZSS: f0d5dfadc5af2d7b3d438aae9a0b82a6  [pass]
    TLOGO-2.LZSS: dea0967bdd7080aad81cba8cab031d88  [pass]
     LABS-1.LZSS: 73fe4fca0d51771dd004252920e484ef  [pass]
   DIFFS0-1.LZSS: 80beaf40384424c58a4b4a2880e4bc2d  [pass]
   DIFFS0-0.LZSS: aa04f99c36b09d120976387bc401d012  [pass]
     LABS-0.LZSS: a2ec6535216d220678fc648b9447bcd4  [pass]
    TLOGO-3.LZSS: c29eaeb0b4e098e1e3dc8a026397bd88  [pass]
      PAL-3.LZSS: 621953c4d0de754813b0610c1743450e  [pass]
 STONBRDG-0.LZSS: 28ed473fa81cd13e095d41103606cccf  [pass]
 STONBRDG-1.LZSS: f0f94a54dadde16faac7273edc99bd30  [pass]
      PAL-2.LZSS: 621953c4d0de754813b0610c1743450e  [pass]

Okay, much better but still a couple of failures. Looks like we may have a little more work to do here.


Digging deeper

Let’s start with the first of our continuing failures STONEBRDG-2. We’ll do the same process as before. Using our instrumented version of the decompressor to generate logs of decompressing the reference and the version we generated.

-rw-r--r--  15051  4 Jul 14:21 test-p93/STONBRDG-2.LZSS
-rw-r--r--  15051  4 Jul 15:06 test-p93/STONBRDG-2.RLZ

MD5 (test-p93/STONBRDG-2.LZSS) = cff6b40434e480a1cc404912a8b41fb1
MD5 (test-p93/STONBRDG-2.RLZ)  = 9f93672804a342a829d9b9a59e788bfe

cmp: test-p93/STONBRDG-2.LZSS test-p93/STONBRDG-2.RLZ differ: char 15046, line 104

MD5 (test-p93/STONBRDG-2.PLN)  = 7196d8f79a915cae192cb1a07ad8162e
MD5 (test-p93/STONBRDG-2.OUT)  = 7196d8f79a915cae192cb1a07ad8162e

Well the good news here is that the decompressed output matches the reference, so we are still generating a valid compressed stream. Also interesting that the sizes are exactly the same size. Looking at the decompression logs we can see the following.

Diff of decompression logs. Left is reference, right is the newly compressed file

hmm this one is tougher, as the MicroProse reference is using a different offset than we are here, one that is not the closest in distance. It goes against what we’ve previously seen of using the nearest match. Let’s see if this is also what is happening with the other STONEBRDG file that is failing.

-rw-r--r--  10558  4 Jun 00:36 test-p93/STONBRDG-3.LZSS
-rw-r--r--  10558  4 Jul 15:22 test-p93/STONBRDG-3.RLZ

MD5 (test-p93/STONBRDG-3.LZSS) = 407380a7bb3b6240b01997986caa0d00
MD5 (test-p93/STONBRDG-3.RLZ)  = dcd397bf363bfc9db68688783e569c0e

cmp: test-p93/STONBRDG-3.LZSS test-p93/STONBRDG-3.RLZ differ: char 10553, line 92

MD5 (test-p93/STONBRDG-3.PLN)  = 7572eb9a4d1fc5cceaed41b236e9099b
MD5 (test-p93/STONBRDG-3.OUT)  = 7572eb9a4d1fc5cceaed41b236e9099b
Diff of decompression logs. Left is reference, right is the newly compressed file

Sure enough we see the exact same pattern here. Looking at the compression logs this variance is happening when both the window size and maximum string length are decaying. Not sure what logic they are using here to cause a more distant matching string to be used. After trying a few different things in the code, I think this may be something that we need to accept as an acceptable variance. Perhaps the reference compressor is using a tree style lookup, and that could explain the odd offset, just strange that it shows up now at the end. Regardless, it does not appear to be a critical difference. It’s an odd edge-case, and our stream is still correct for the data passed in, just not exact to the MicroProse reference. I don’t anticipate this difference representing a problem for the MicroProse PIC93 decoder in RRDX. As such I don’t think it’s worth the effort at this time to try and get an exact match here.

 STONBRDG-2.LZSS: cff6b40434e480a1cc404912a8b41fb1  [WARN] Decompressed Output: [pass]
      PAL-1.LZSS: 621953c4d0de754813b0610c1743450e  [pass]
      PAL-0.LZSS: 621953c4d0de754813b0610c1743450e  [pass]
 STONBRDG-3.LZSS: 407380a7bb3b6240b01997986caa0d00  [WARN] Decompressed Output: [pass]
   DIFFS0-3.LZSS: f7c292a6262fa45142ee428a9fd5821a  [pass]
     LABS-3.LZSS: 7ede3c5c0e4a147087ddc0e0d6fdc186  [pass]
    TLOGO-0.LZSS: e8f33583bd2479f7c1b70adb3405482b  [pass]
    TLOGO-1.LZSS: 2ca3528f3f2fb892fd5ea4b42997af63  [pass]
     LABS-2.LZSS: 73fe4fca0d51771dd004252920e484ef  [pass]
   DIFFS0-2.LZSS: f0d5dfadc5af2d7b3d438aae9a0b82a6  [pass]
    TLOGO-2.LZSS: dea0967bdd7080aad81cba8cab031d88  [pass]
     LABS-1.LZSS: 73fe4fca0d51771dd004252920e484ef  [pass]
   DIFFS0-1.LZSS: 80beaf40384424c58a4b4a2880e4bc2d  [pass]
   DIFFS0-0.LZSS: aa04f99c36b09d120976387bc401d012  [pass]
     LABS-0.LZSS: a2ec6535216d220678fc648b9447bcd4  [pass]
    TLOGO-3.LZSS: c29eaeb0b4e098e1e3dc8a026397bd88  [pass]
      PAL-3.LZSS: 621953c4d0de754813b0610c1743450e  [pass]
 STONBRDG-0.LZSS: 28ed473fa81cd13e095d41103606cccf  [pass]
 STONBRDG-1.LZSS: f0f94a54dadde16faac7273edc99bd30  [pass]
      PAL-2.LZSS: 621953c4d0de754813b0610c1743450e  [pass]

Sadly not a fully satisfying conclusion here, would have loved to have pure passes for all files, but sometimes that is just not possible, or practical. I think its fairly safe to accept this particular variance. If for whatever reason it proves to be a problem down the road, we’ve noted it here, and can always re-visit it then. For now I think we have a valid solution for a compressor to be used for the PIC93 variant of the MicroProse PIC file format used with Railroad Tycoon Deluxe (RRDX). So with that we’ll wrap this one up, we now have all the pieces we need to be able to put together a PIC93 encoder in our toolbox, but that will be the subject for another post.

By Thread



Leave a comment