You are here: The Oldskool PC/Oldskool PC Software/LZ4 Decompression for the 8088 |
LZ4_8088 is a chunk of assembly code that implements incredibly fast LZ4 decompression for 8088 and 8086 CPUs. It is specifically optimized for the 8088 and is intended for use in hobby or retrocomputing projects. Code is provided for most generic x86 assemblers, as well as an example of how to use it in Turbo Pascal.
LZ4 itself is a compresson format that is implemented as an open-source C library, with ports to other platforms. The goal of the library is speed, and it currently holds many top speed rankings in compression benchmarks. One variant of LZ4 called LZ4_HC implements optimal parsing, which creates the very best possible set of literal and match runs for a given input and coding set. This leads to compression ratios that are competitive with PKZIP/zlib, but unlike PKZIP, LZ4 doesn't implement order-0 coding (ie. bit twiddling which 8088/8086 is very bad at) so the end result decompresses at nearly memcpy() speeds.
Here are some statistics comparing LZ4 with PKWare's Data Compression Library (DCL). The DCL was chosen as a comparison because it uses an algorithm very similar to deflate (which was considered state of the art speed-wise for many years on DOS platforms), and also because the DCL could be measured with microsecond accuracy. To eliminate implementation bias, PKWare's retail DCL compiled assembly code was used.
Data Type | Filename | Original size | LZ4_HC compressed size | LZ4_HC ratio | DCL Implode size | DCL Ratio | memcpy() time in µs (REP MOVSW) | LZ4 Decompression Speed (x slower than memcpy) | DCL Decompression speed (x slower than memcpy) | ||
Sparse bitmap | shuttle.pic | 16390 | 3297 | 20% | 2552 | 16% | 44989 | 1.91 | 20.50 | ||
RLE bitmap | rle.dmp | 16384 | 436 | 3% | 387 | 2% | 44977 | 0.76 | 5.79 | ||
Small text file | text.txt | 4988 | 3037 | 61% | 2323 | 47% | 13716 | 3.17 | 61.21 | ||
Large text file | largetxt.txt | 56899 | 26890 | 47% | 25642 | 45% | 156457 | 3.17 | 54.43 | ||
Sparse compiled binary | robotron.com | 40704 | 21048 | 52% | 18178 | 45% | 112036 | 2.75 | 53.51 | ||
Dense compiled binary | linewars.exe | 61744 | 41500 | 67% | 36641 | 59% | 169924 | 2.16 | 70.08 | ||
Uncompressable data | random.bin | 64000 | 64260 | 100% | 71822 | 112% | 176113 | 1.39 | 116.14 |
Key takeaways from the above table:
For full documentation, please see the included LZ4_8088.TXT in the download distribution. It is highly recommended you read the documentation to avoid any pitfalls using the code.
LZ4_8088.ZIP contains the assembler routine, a Turbo Pascal test harness, documentation, compression samples, and compiled binaries for Win32 and DOS 16-bit.
The LZ4 library is provided under the BSD License. However, my code was not derived from any of the original LZ4 code so I can provide it via any license I choose. So, I am providing my code under what I am calling the Demoscene License. The Demoscene License grants you the following rights:
Q: Is this code faster than the LZ4 C source code? For 8088-80286 CPUs, yes. For 386+, no, because the 386 has 32-bit registers, additional segment registers, and additional instructions. Compiling the LZ4 C code with a suitable 32-bit compiler will produce code that definitely outperforms this 8088 assembly code.
Q: How is it possible to exceed the speed of a REP MOVSW? Because 1-byte and 2-byte runs in the compressed data are handled specially with REP STOSW. STOSW can set a fixed value in memory in half the time it takes to move a value from one memory location to another.
Q: Why did you bother doing this? Because I like impressing the teenage version of myself.
Currently this distribution only includes decompression code. It is possible to implement LZ4 "c0" compression on 8088 in as little as 16K memory and with reasonable speed, but I don't plan on developing it.
20130107: Initial release.
20130209: Now includes a size-optimized version of the code as well. The speed routine compiles to 446 bytes including the shift table; the size-optimized code is 30% slower but compiles to 79 bytes. Also, speed-optimized version of the code sped up an additional 1% by contributions from both Peter Ferrie and Terje Mathisen (thanks guys!)
20130210: Forgot an optimization; the size version now optimizes to 78 bytes.
20130213: Fixed a bug that could corrupt output if the matches overlapped.
20190615: Size version optimizes down to 79 bytes, not 78. Also unified the .PAS and .ASM files.
20190617: Added size fixups by Axel Kern. Refactored code into a single .ASM file; the .PAS is the test harness only.
20190630: Fixed a bug in _small; assembles to 78 bytes.
20200314: Added speed improvements by Pavel Zagrebin. These speed optimizations make Peter Ferrie's "reversed match offsets" optimization obsolete, so that version was removed.