Data Compression and Decompression in Tight Places (C) Copyright 1992, Jeffrey R. Fox By Jeffrey R. Fox (Contract Programmer) 802 Ithaca Dr. Boulder, CO 80303 303-494-0343 I describe in this report an implementation of a general purpose data compression/decompression algorithm which is simple, achieves good compression, and uses few memory resources. The primary liability is a slow compression rate, but that is not a significant factor in many applications, and optimizations which trade higher memory usage during compression for higher compression rates are referenced. The present FORTH language implementation is particularly targeted at programing environments, such as embedded FORTH systems, where RAM memory is very limited. Possible applications in such environments are loading of compressed source and binary overlays into memory, compression of logged data, transfer of compressed data over a bandwidth limited communications link, access of compressed data bases, etc. This implementation is configured to compress and subsequently decompress DOS files in order to simply illustrate the basic functionality. Interest in data compression has been rapidly increasing lately, but the subject is older than the computer revolution itself. Data compression is achieved by removal of the natural redundancies of the original information, and that is also a key step in the process of creating realiable cyphers, that is, methods of writing secret messages. The method of encryptation used by Benedict Arnold during the revolutionary war is illustative: individual words of a message were replaced by codes indicating the location of the word in a particular book possessed by Arnold and by his British contact. If one considers the book in question to be a dictionary or glossary, and the codes to be a index to the words of the dictionary, and if the average length of the index codes is less than the average length of the replaced words, a very simple compression scheme is apparent. For example, if the message or file has an average word length of 6 ascii characters and the index codes consist of two ascii characters, a dictionary of over 16000 entries can be indexed by these codes and a 3 to 1 compaction could, in principle, be achieved. Dictionary methods form one of the two main classes of compression algorithms, and the primary references of interest in the present discussion are the two ground-breaking papers of Ziv and Lempel (1977 and 1978). The other main class of algorithms are statistical in nature, and based upon the idea that since the individual characters (or words) of a file or message often occur in vastly different proportions, the codes used to represent the more frequently occuring characters (or words) can be chosen to be much shorter than those representing the less frequent characters (or words), and significant compaction can result. An historic example is the design of the Morse code; the most frequently used characters in written English have the shortest codes. In English text files in the form of a sequence of 8-bit (extended ascii) characters, one may find that as few as 10 or 15 frequent characters such a space, 'e', 't', 's', etc., may account for half of the characters in the file. If the first bit of a code indicates whether the code represents one of the most frequent 16 characters or not, then these sixteen characters could be represented by 5-bit codes, with all other characters represented by 9-bit codes. The result is an average of 7 bits per code and an 8 to 7 compaction. But the process may be continued by dividing the two groups into subgroups based upon frequency, and using a second bit to distinguish among the subgroups, etc. This is a variant of the algorithm commonly referenced as "Huffman encoding" and the primary references are Huffman (1952) and Galleger (1978). The design of present day compression technology is characterized by a number of sensitive tradeoffs between the computer resources, such as time and memory, required by the compression and decompression algorithms and the degree of compaction achieved. The highest degree of compaction is reached by approaches which combine dictionary and statistical methods, that is, schemes in which the codes output by a dictionary method are statistically processed by Huffman encoding. The popular archiving programs PKZIP and LHARC use these compound methods, and, interestingly, Japanese hobbiests have been at the forefront in the development of the best implementations. The earlier archivers, in particular ARC and PKARC and UNIX's COMPRESS and TAR were based upon the second of the algorithms of Ziv and Lempel, and in particular on the popular variations introduced by Welch (1984). In this report, we describe in detail an implementation of a modification of the first of the proposals of Ziv and Lempel. The implementation is very close to that of Fiala and Greene (1989, method A1). In this algorithm the dictionary to which an index code refers is a "sliding window" composed of the previous 4096 characters (i. e., bytes) of the file (in its uncompressed form). Thus compression is achieved by replacing the second occurrence of two identical strings of characters by a reference to the first. The "reference" is an index code consisting of 4 bits of string length information followed by 12 bits of positional or address information. Character strings for which a previous occurance cannot be found are simply incorporated literally into the compressed file. The compressed file is thus a sequence of literal strings and reference codes with format: First (upper) nibble of byte is either: 0 -- meaning a literal string follows, in which case the lower nibble is the count (minus 1) of the string and the following count bytes are the string itself (1 to 16 characters) 1-15 -- meaning the following 12 bits are an offset address into the sliding window and this nibble is the length (minus 1) of the string to be copied from that address (2 to 16 characters). In this case the lower nibble is the low 4 bits of the offset into the window and the next byte contains the upper 8 bits. Ignoring for the moment the problems associated with initializing buffers and starting the compression algorithm, consider the central loop of the FORTH word CRUNCH. While there still remain more characters in the uncompressed file as yet unprocessed, one attempts, via MATCHSTRING, to find a sequence of characters which is identical to the sequence of unprocessed characters from the input file which, for the purpose of the matching mechanism, have been moved to a short buffer (STRINGBUF). If no matching string is found, the first character of the unprocessed input is appended (by ADD_TO_LITERAL) to a literal string buffer (LITERALBUF), and if the maximum length for literal strings is thereby reached, the literal string is output by SEND_LITERAL_STRING. If a string match is found, any pending literal string is first output, and then the code for the string match is output by SEND_CODE. In either case, once a character has been processed, it is inserted into the sliding window buffer (SLIDING_WINDOW) at offset position WINDOW_POINTER by CHR->WINDOW. The window pointer is always wrapped at 4096, and thus the window is effectively a "ring buffer." Finally, REFILL obtains an appropriate number of new input characters via the I/O word GETCHR and appends them to the string in STRINGBUF. Initializing the compressor involves filling the sliding window with blanks, setting the window pointer to zero, and initializing the pointers for the literal and I/O buffers. When REFILL detects an end of file condition, the central loop in CRUNCH is exited, and all remaining unprocessed characters are sent to the output compressed file as a literal string. Decompression (UNCRUNCH) of a compressed file involves sequential decoding of literal strings and index codes into output characters strings. The central loop always starts with an input byte (via GETCHR) known to contain the bits which determine the type and length of the next output string. If the upper nibble is zero, then the lower nibble determines the length of a literal string, which is then retrieved byte by byte by GETCHR_&_CHECK. (If a file error or end of file occurs during this operation, then decompression is aborted; this is a fatal error.) If the upper nibble is non-zero, then it determines the length, and the next 12 bits of input determine the offset of a string to be copied from the sliding window to the output file via OUTSTRING. In either case, the sliding window is updated with each processed output character by CHR->WINDOW. The rate limiting step of the compressor is the brute-force string search performed by the nonstandard primitive word STRNDX. Since this word is already in machine code, there is little speed optimization one can do without introducing a more sophisticated algorithm for this search. The binary tree algorithm of Bell (1986) is used in the archiver LHARC. At another time I shall perhaps present the code for a hashing algorithm. But in any case the memory requirements for compression will rise by at least 30K when such algorithms are used. References: J. Ziv and A. Lempel, IEEE Trans., 23, 337-343 (1977). J. Ziv and A. Lempel, IEEE Trans., 24, 530-536 (1978). D. A. Huffman, Proc IRE, 40, 1098-1101 (1952). R. G. Gallager, IEEE Trans., 24, 668-674 (1978). T. C. Bell, IEEE Trans., 34, 1176-1182 (1986). T. A. Welch, Computer, 17, No. 6, 8-19 (1984). E. R. Fiala and D. H. Greene, Comm. of the ACM, 32, No. 4, 490-503 (1989). P.S. Don't use this code without thorough testing. I accept no responsibility for any usage. An effective way of debugging programs of this type is to set up both compressor and decompressor to operate simultaneously, and to feed the output of the compressor directly into the decompressor, monitoring all the information streams. .THEN DOSINT \ this loads a DOS file interface DECIMAL \ flags for conditional compilation FALSE CONSTANT DEBUGGING \ true if you want the FCOPY test loaded \ ************ FILE I/O ROUTINES ("stream i/o") *********** 0 .IF These are general purpose routines for the purpose of reading and writing character streams from/to ms-dos files. There is nothing specific to the compression/decompression algorithm here. .THEN HCB IFILE ( handle control block for input file ) HCB OFILE ( handle control block for output file ) \ necessary buffers and tables 4096 CONSTANT BUFSIZE \ make this much smaller, ie 128, if you want to save memory CREATE IBUFF BUFSIZE ALLOT ( buffer for input file ) CREATE OBUFF BUFSIZE ALLOT ( buffer for output file ) : INPUT-FILENAME ( -- ) \ read file name from console HERE 30 EXPECT HERE SPAN @ STRPCK SWAP NAME>HCB ; : GET_FILE_NAMES ( -- ) \ use an at the OK? prompt to abort BEGIN CR ." Enter source file name: " IFILE INPUT-FILENAME CR ." Enter destination file name: " OFILE INPUT-FILENAME CR ." Source file = " IFILE HCB>N .FILE CR ." Destination file = " OFILE HCB>N .FILE CR ." OK? (use esc to abort)" KEY DUP 27 ( ) = IF DROP ." aborting " QUIT THEN DUP EMIT 32 OR ASCII y = ( i.e. Y or y) UNTIL CR ; : OPEN_FILES IFILE O_RD FOPEN IF CR ." Source file not found" CR QUIT THEN OFILE 0 FMAKE IF CR ." Can't make destination file" CR IFILE FCLOSE DROP QUIT THEN ; : .FILESIZE ( hcb -- d ) \ print filesize in bytes of open file; moves file pointer! 0. 2 FSEEK D. ; : .FILESIZES ( -- ) IFILE CR ." Source file = " .FILESIZE ." bytes" OFILE CR ." Destination file = " .FILESIZE ." bytes" ; VARIABLE CHRS-IN-BUF \ the number of characters read into the input buffer VARIABLE INPOINTER \ pointer to the next input character to be processed : GETCHR ( -- c f) ( get chr from source file,f=0 is EOF) CHRS-IN-BUF @ INPOINTER @ 2DUP <= IF 2DROP IFILE IBUFF BUFSIZE FREAD DUP CHRS-IN-BUF ! DUP IF ." I" THEN INPOINTER OFF 0 THEN IBUFF + C@ SWAP 1 INPOINTER +! ; VARIABLE OUTPOINTER \ points to next unused position in output buffer : OBUFF-FLUSH ( -- ) \ flush all remaining chrs in OBUFF to output file OUTPOINTER @ IF OFILE OBUFF OUTPOINTER @ FWRITE ." O" OUTPOINTER @ <> IF CR ABORT" ERROR: output to file failed" CR THEN OUTPOINTER OFF THEN ; : PUTCHR ( c -- ) \ output byte to output file buffer, flushing when needed OUTPOINTER @ BUFSIZE >= IF OBUFF-FLUSH THEN OUTPOINTER @ OBUFF + C! 1 OUTPOINTER +! ; : CLOSE_FILES IFILE FCLOSE DROP OBUFF-FLUSH OFILE FCLOSE DROP ; : INIT-BUFFS CHRS-IN-BUF OFF INPOINTER OFF OUTPOINTER OFF ; DEBUGGING .IF : FCOPY ( -- ) \ copy one file to another; test file i/o routines GET_FILE_NAMES OPEN_FILES INIT-BUFFS BEGIN GETCHR WHILE ( not at end of file ) PUTCHR REPEAT OBUFF-FLUSH .FILESIZES CLOSE_FILES ; : PUTCHR ( c -- ) \ debugging replacement . 32 EMIT ; : GETCHR ( -- c f ) \ debugging replacement KEY DUP 26 ( ^Z) = NOT ; .THEN \ ************ END OF FILE I/O ROUTINES *********** \ the maximum size of a sliding window indexed by 12 bit codes is 4096 4096 CONSTANT MAXCODE \ create sliding window CREATE SLIDING_WINDOW MAXCODE ALLOT \ sliding window of uncompressed text 16 ALLOT \ string "spill" area -- makes string searches easier VARIABLE WINDOW_POINTER \ pointer into the SLIDING_WINDOW ring buffer CREATE STRINGBUF 16 ALLOT \ where incomming text is stored for matching \ this buffer is used as a first in, first out (FIFO) structure VARIABLE LENGTH \ length of string to be matched VARIABLE PROCESSED \ # of processed chars in STRINGBUF CREATE LITERALBUF 16 ALLOT \ where literal output string is built VARIABLE LIT_LENGTH \ length of literal string thus far built : CHR->WINDOW ( c -- ) \ move one character to the SLIDING_WINDOW at the WINDOW_POINTER & bump pointer WINDOW_POINTER @ DUP 16 < IF 2DUP SLIDING_WINDOW + MAXCODE + C! THEN \ this copies chars (0-15) to the spill area, a convenience for MATCHSTRING SLIDING_WINDOW + C! \ store the char in the SLIDING_WINDOW WINDOW_POINTER DUP @ 1+ [ MAXCODE 1- ] LITERAL AND SWAP ! \ pointer updated and wrapped @ MAXCODE ; : SEND_LITERAL_STRING LIT_LENGTH @ DUP 1- PUTCHR \ send the length-1 LITERALBUF + LITERALBUF DO I C@ PUTCHR LOOP \ send the string LIT_LENGTH OFF ; : ADD_TO_LITERAL ( -- ) \ add the 1st chr in stringbuf to the literal string STRINGBUF C@ DUP CHR->WINDOW LIT_LENGTH @ TUCK LITERALBUF + C! 1 LIT_LENGTH +! 15 = IF \ literal string at length limit SEND_LITERAL_STRING THEN STRINGBUF 1+ STRINGBUF 15 CMOVE 1 PROCESSED +! ; : MIN_MATCH_LENGTH ( --n ) LIT_LENGTH @ IF 3 ELSE 2 THEN ; : MATCHSTRING ( -- offset length ) ( length = 0 if no match) MIN_MATCH_LENGTH LENGTH ! SLIDING_WINDOW MAXCODE BEGIN LENGTH @ 17 < IF 2DUP STRINGBUF LENGTH @ STRNDX DUP -1 <> ELSE \ complete string match, look no further 0 FALSE THEN WHILE 1 LENGTH +! TUCK - >R + R> REPEAT 2DROP SLIDING_WINDOW - LENGTH @ 1- DUP MIN_MATCH_LENGTH < IF DROP 0 THEN ; : STRING_SHIFT_LEFT ( n -- ) STRINGBUF 2DUP + SWAP ROT 16 SWAP - CMOVE ; : SEND_CODE ( offset length -- ) \ send the matched string to the sliding_window, \ and output the length and offset packed into 2 bytes DUP STRINGBUF + STRINGBUF DO I C@ CHR->WINDOW LOOP DUP PROCESSED +! DUP STRING_SHIFT_LEFT 1- 4 SHIFT OVER 15 AND OR PUTCHR -4 SHIFT PUTCHR ; : REFILL ( -- f) ( replenish STRINGBUF from input file, FALSE flag means eof read) BEGIN PROCESSED @ DUP IF GETCHR 0= IF 2DROP FALSE \ it's the end of the input file ELSE TRUE THEN THEN WHILE ( not at end of file nor is STRINGBUF fully replenished) [ STRINGBUF 16 + ] LITERAL ROT - C! -1 PROCESSED +! REPEAT PROCESSED @ 0= ; : INIT-WINDOW \ fill the sliding window with blanks to start SLIDING_WINDOW [ MAXCODE 15 + ] LITERAL BLANK WINDOW_POINTER OFF ; : SEND_FINAL_STRING \ the last few input characters are sent as a literal string 16 PROCESSED @ - ?DUP IF \ any left? DUP 1- PUTCHR \ the length 0 DO I STRINGBUF + C@ PUTCHR LOOP THEN ; : CRUNCH ( -- ) INIT-BUFFS INIT-WINDOW LIT_LENGTH OFF 16 PROCESSED ! \ force buffer to refill BEGIN REFILL WHILE ( not at end of input file) MATCHSTRING ?DUP \ did we find a match? IF LIT_LENGTH @ IF SEND_LITERAL_STRING THEN SEND_CODE ELSE DROP ADD_TO_LITERAL THEN REPEAT \ at this point we've already detected the end of \ the input file, so we finish up by flushing all \ remaining characters out as literal strings LIT_LENGTH @ IF SEND_LITERAL_STRING THEN SEND_FINAL_STRING ; : GETCHR_&_CHECK ( -- c) ( aborts execution if end of file detected) GETCHR 0= IF CR ." unexpected end of file -- aborting decompression " CR QUIT THEN ; : OUTSTRING ( address length -- ) \ must be careful not to overwrite what you are reading! DUP >R STRINGBUF SWAP CMOVE STRINGBUF R> + STRINGBUF 2DUP DO I C@ PUTCHR LOOP DO I C@ CHR->WINDOW LOOP ; : UNCRUNCH ( -- ) INIT-BUFFS INIT-WINDOW BEGIN GETCHR WHILE DUP 15 > IF ( starting a code) ( -- byte [length-1]*16 ) DUP -4 SHIFT 1+ \ this is the length SWAP 15 AND \ this is the lower 4 bits of the offset GETCHR_&_CHECK 4 SHIFT OR SLIDING_WINDOW + ( -- length address ) SWAP OUTSTRING ELSE ( it's a literal string) 1+ \ the length 0 DO GETCHR_&_CHECK DUP PUTCHR CHR->WINDOW LOOP THEN REPEAT DROP ; : START ( -- ) CLS GET_FILE_NAMES OPEN_FILES ; : FINISH ( -- ) OBUFF-FLUSH .FILESIZES CLOSE_FILES ; : COMPRESS ( -- ) START CRUNCH FINISH ; : DECOMPRESS ( -- ) START UNCRUNCH FINISH ; .( LEMPEL-ZIV CODE LOADED) QUIT GLOSSARY of nonstandard words: FREAD ( hcb adr len -- len') read len bytes from file to adr, len' was read FWRITE ( hcb adr len -- len') write len bytes from adr to file, len' was written FOPEN ( hcb mode -- status) open file or device O_RD ( -- mode ) read only mode for FOPEN FCLOSE ( hcb -- status) close file indicated by hcb 0. 2 FSEEK ( hcb -- d) move file pointer to end of file, report position -4 SHIFT means 16 / 4 SHIFT means 16 * .IF ( f -- ) if true, stop interpretation until .ELSE or .THEN is parsed .ELSE stop or start interpretation .THEN mark end of conditional interpretation/compilation CLS clear and home screen 0 CONSTANT TRUE -1 CONSTANT FALSE : OFF ( adr -- ) FALSE SWAP ! ; : ON ( adr -- ) TRUE SWAP ! ; HCB make handle control block associated with : TUCK ( n1 n2 -- n2 n1 n2 ) SWAP OVER ; : NIP ( n1 n2 -- n2 ) SWAP DROP ; STRNDX ( str_addr str_len patn_addr patn_len --- n ) searches memory starting at str_addr for a string matching that at address patn_addr and of length patn_len. The argument returned is -1 if no match or the address of the match, an address between str_addr and [str_addr+str_len-patn_len]. the SLIDING_WINDOW WINDOW_POINTER DUP @ 1+ [ MAXCODE 1- ] LITERAL AND SWAP ! \ pointer updated and wrapped @ MAXCODE ; : SEND_LITERAL_STRING LIT_LENGTH @ DUP 1- PUTCHR \ send the length-1 LITERALBUF + LITERALBUF DO I C@ PUTCHR LOOP \ send the string LIT_LENGTH OFF ; : ADD_TO_LITERAL ( -- ) \ add the 1st chr in stringbuf to the literal string STRINGBUF C@ DUP CHR->WINDOW LIT_LENGTH @ TUCK LITERALBUF + C! 1 LIT_LENGTH +! 15 = IF \ literal string at length limit SEND_LITERAL_STRING THEN STRINGBUF 1+ STRINGBUF 15 CMOVE 1 PROCESSED +! ; : MIN_MATCH_LENGTH ( --n ) LIT_LENGTH @ IF 3 ELSE 2 THEN ; : MATCHSTRING ( -- offset length ) ( length = 0 if no match) MIN_MATCH_LENGTH LENGTH ! SLIDING_WINDOW MAXCODE BEGIN LENGTH @ 17 < IF 2DUP STRINGB