I completed the output file generation including tail call optimisation, tail jump elimination, and inlining repeats which are not providing any gain. The file size reductions are as follows:
Track Original Compressed Reduction Repeats 01 348 244 30% 14 02 1827 1383 24% 93 03 552 469 15% 33 04 3899 2820 28% 329 05 2918 1662 43% 196 06 1196 776 35% 62 07 1023 360 65% 27 08 3116 1431 54% 200 09 1256 1040 17% 89 10 3121 1531 51% 192 11 4685 2748 41% 410 12 2652 901 66% 141 13 6320 3406 46% 562 14 950 774 19% 52 15 4835 3533 27% 445 16 1814 433 76% 26 17 758 690 9% 43 18 1395 1157 17% 85 19 7476 2399 68% 283 20 1169 1010 14% 100 21 5490 3863 30% 390 22 963 750 22% 55 23 486 386 21% 23 24 1758 576 67% 90 25 690 639 7% 47 26 403 388 4% 6 27 601 491 18% 12 Total 61701 35860 42% 4005
I also added a column with how many repeated sections were extracted from each file.
Since each repeat call corresponds to 3 bytes in my implementation, if the call command size can be encoded in 2 bytes it will save an additional 4 kB. It would also be interesting to see if the call command size could be reduced further. E.g. for the most frequently occurring repeats one could use a small jump table and single-byte calls.
This is similar to JP vs JR and CALL vs RST. But this brings us back into the territory of compacting by tweaking the encoding, and with Z80 code optimisation the opportunities for that are limited since the instruction set is fixed.
Cool! So, just to make sure I understand, this last table already includes the extra space for the "call/ret" (or "jp" in case of tail call), right? 42% is quite amazing!
As you mentioned, compression of actual source code might be much smaller as the programmer already has done part of this optimization work, but still it looks very promising! I was also wondering if it could be turned into a general data compression scheme and how would it compare with pletter/aplib/exomizer in terms of compression rate and decompression speed.
Correct. The tail call optimisation and tail jump elimination is responsible for 2193 saved bytes.
Yes this compression is also applicable to byte data; the challenge there is to come up with an efficient encoding of the repeats when the alphabet is 256 entries. What comes to mind initially is a fixed bit-oriented Huffman encoding, or LZEXE’s byte-oriented method of interleaving a literal-or-repeat command stream and a data stream. Compression ratio compared to e.g. Pletter should be tested, it depends a lot on the encoding, but it could be good.
I was also wondering if it could be turned into a general data compression scheme and how would it compare with pletter/aplib/exomizer in terms of compression rate and decompression speed.
The rate would not be so good as in LZ77, it would require more memory, and the speed would be comparable.
LZ77 encodes repetitions of a string as references to the previous appearance of the string. This is done by adding a pointer and a length to the stream. For example, ABCABCDABFCD could be encoded as: ABC, (0, 3), D, (0, 2), F, (5, 2). (The pointers are references to the already decompressed part of the stream, so it may not be immediately obvious that 5 points to where CD starts, but it's easy to see by examining the input string). In practice the pointers are typically encoded as negative offsets from the current position. This scheme only needs to "see" the current character or pair, therefore it can work with a very small input buffer.
On the other hand, this method can only encode string terminations. It can't encode only part of a string previously seen. In the case above, I guess that one possible encoding would be AB, C, call (address of AB), CD, call (address of AB), F, call (address of CD). Another possible encoding would be: ABC, call (address of ABC), DABFCD. Unlike LZ77, it would need the full compressed stream to remain in memory during decompression, so that the RET markers are preserved, as they are lost in the decompressed stream.
The rate would not be so good as in LZ77, it would require more memory, and the speed would be comparable.
I’m not sure I agree with that, from what I’ve read the category of off-line grammar compression is competitive with if not better than LZ77 on-line compression (e.g. see the introduction here). Primarily because it can consider the whole file rather than limited dictionary and look-ahead windows. From what I’ve read this can outweigh the benefit of LZ77’s ability to use overlaps.
After a quick google I found this survey paper which amongst others compares LZ77 to LZ78 and confirms that their performance is comparable, and the best performing one is LZ78-based LZFG. Note that LZ78 falls in the grammar-based compression category, but is an inefficient variant (see comments in section II here).
I think I’ve also kind of seen this empirically in my own tests, since I gained a similar 63% sequence size reduction as zip did on the file size, and zip applies Huffman compression in addition to LZ77. I didn’t apply Huffman encoding on the output due to constraints of my of intended use, but if I had I think I would’ve achieved similar file sizes.
As for requiring more memory, that depends on your perspective. Off-line means that it can’t be streamed so you have to keep the entire compressed data in memory. But this is not an issue in many cases, e.g. when decompressing from ROM to RAM. However, it’s true that when used as a general decompressor which outputs to a RAM buffer, the benefit that it doesn’t need a RAM buffer for the dictionary is lost.
The longest-first context-free grammar for the ABCABCDABFCD example would be:
s -> 11D2FCD
1 -> 2C
2 -> AB
I'm not sure either. In my experience, each compressor has some biases that favors some data over another, since (at least from what I remember), no compression algorithm has been shown to be optimal in the general case yet (some methods are optimal under some assumptions though). So, as long as we have no optimal compression algorithm (which theoretically sounds like we will never have it), I don't think there's going to be any compressor better than the others for all possible data strings.
So, I was just curious to see if this could result in a new compressor to add to the current set of compressors we have. In my latest game I decided to use two separate ones (aplib and pletter), since each of them compressed better different sets of files, and the savings rom using both outweighed having the source code of both unpackers. So, having more options is always good
By the way, I tested Pletter on the set of music files prior to my compaction (60 kB), and the total compressed size added up to 17 kB, where zip only got down to 22 kB. After my compaction (36 kB) they compress to 24 kB and 28 kB respectively. I haven’t verified the output, but under the assumption that it’s correct, its performance on this data set is quite impressive given that it’s presumably a much more simplistic compression scheme.
So, as long as we have no optimal compression algorithm (which theoretically sounds like we will never have it), I don't think there's going to be any compressor better than the others for all possible data strings.
Compression is always relative to a model. There are compression methods that are objectively better than others. For example, arithmetic compression is always better (or equal in extreme case) than Huffman compression for every file; however, the model it is based on (zero-order Markov chains) is very simplistic and not common in real data, but it's been proven that it's the best possible compression for that model.
The question that remains is whether one of the methods discussed (LZ77 and grammar-based) is a subset of the other, as in the case of Huffman and arithmetic (arithmetic is like Huffman, but allowing for fractional bit lengths). I guess each one's performance depends on the details of the coding and on how hard the compressor tries in minimizing the output, and that can favour one kind of files over another. Again, these details influence the compression model that the compressor applies. For example, typical LZ77 compressors tend to favour shorter lengths and offsets, in the sense that they are represented with fewer bits because they are assumed to be more common, but if asked to compress files where that's not the case, the compression will suffer in comparison to a different model. Which ones are more common in practice, is what matters. It's also often the case that some methods are better suited to certain types of files than others.
Primarily because it can consider the whole file rather than limited dictionary and look-ahead windows.
I assumed typical sizes for MSX. Pletter seems to be using a 64K window, which I assume is the maximum size of a decompressed stream in MSX. Therefore the window covers the whole file.
and zip applies Huffman compression in addition to LZ77.
Yeah, zlib is not really comparable. One like Pletter is what I had in mind. LZEXE uses a 16K sliding window, leaving it out of the table, as uncompressed sizes bigger than 16K are common. Anyway, did you try zlib with compression level 9? AFAIK the only difference is in how hard the compressor tries to find repetitions, therefore using anything other than maximum would be unfair to the comparison of how well LZ77 is able to compress (but its use of Huffman discards it as a fair competitor). Also, better try with gzip, as the zip headers overhead is not negligible for small file sizes, unless you're talking about the compressed size reported when listing the contained files.
As for requiring more memory, that depends on your perspective. Off-line means that it can’t be streamed so you have to keep the entire compressed data in memory. But this is not an issue in many cases, e.g. when decompressing from ROM to RAM. However, it’s true that when used as a general decompresser which outputs to a RAM buffer, the benefit that it doesn’t need a RAM buffer for the dictionary is lost.
That was my point, as it's what santiontanon asked about: using it as a general decompresser. I was thinking e.g. using it for decompressing disk streams. You could decompress maybe a full 60K (decompressed size) file with a customized Pletter, without too many problems, leaving ~5K for paging the disk handler and other stuff. I don't think you could do that with a grammar-based approach, if you need to keep the grammar in memory, except in rare cases where it is extremely small, or by paging the grammar in and out, making the decompression speed suffer enormously in comparison.
Ah, I interpreted santiontanon’s question about the applicability for general data compression in a bit more limited sense; to compress opaque byte data in the context of compressing game data to fit in a (mega) ROM. So more general than the original use case to size-optimise specific command sequences like assembly code or VGM-like data with procedural abstraction, but more narrow than what you had in mind. For truly general purpose compression it is indeed important that it can compress and decompress on-line with small-ish buffers and in O(n) running time.
If I gzip -9 the uncompacted files it’s 18 kB total. If I gzip the compacted files it’s 24 kB total. So it’s closer to Pletter but I’m still surprised at how competitive it is at least on this data set. Also I didn’t expect Pletter’s dictionary window to be 64 kB, I would’ve thought it was smaller to reduce the size of dictionary references (deflate’s is 32 kB).