Introduction. Squeezing, and Huffman Code. FYNDE.A86 - FYNDE.ASM - Description FYNDE.A86 - FYNDE.ASM - Tutorial Message HJELP.A86 - HJELP.ASM - Description HJELP.A86 - HJELP.ASM - Tutorial Message TYSQ.A86 - TYSQ.ASM - Description TYSQ.A86 - TYSQ.ASM - Tutorial Message UNSQ.A86 - UNSQ.ASM - Description UNSQ.A86 - UNSQ.ASM - Tutorial Message UNSQ80.LIB UNSQ86.LIB Bibliography. :Introduction. ICUAP9 responds to the SIG/M Disk Editor's column in the June, 1984 issue of the ACG/NJ Newsletter, which requested a synthesis of USQ.COM and FIND.COM. We have had a variant of FIND, namely FYNDE (SIG/M 165.09,10,11,12), which resembles the UNIX "grep" for some time; but SQ and USQ were vaguely known by reputation since we have not used them. Since it is only a question of opening the pipeline between the disk input and the application program, it did not seem to be an exceptional challenge to generate this synthesis. The question of locating a copy of SQ.COM and USQ.COM was solved by locating them in SIG/M 58.26 and SIG/M 58.27. The ensuing disassembly revealed the fact that they had been compiled from BDS "C" but comments embedded in the object code, some computer science references, and several experiments in squeezing known text finally showed how these two programs worked. A scan of the SIG/M software guide reveals quite a few squeeze and unsqueeze programs, aside from the fact that it is now an announced policy of the SIG/M Disk Editor to include a copy of an unsqueezer on each disk bearing squeezed files. Some are long, some are short - no doubt due to the fact that compiling a program from source code written in a higher level language will result in voluminous code - perhaps 20 times more than equivalent assembly language. - SIG/M disk libraries seem to contain the following programs: SIG/M 48.13 SQ.COM 14K pack ASCII file SIG/M 48.14 USQ.COM 10K unpack ASCII file SIG/M 58.26 SQ.COM 14K SQueeze compress program SIG/M 58.27 USQ.COM 10K UnSQueeze expans program SIG/M 60.18 SQ-15.COM 14K Updated Squeeze SIG/M 60.21 USQ-15.COM 10K Updated Unsqueeze (and related prog's & doc) SIG/M 144.19 SQ.OBJ 17K Update to squeeze (other related prog's & doc) -> SIG/M 152.13 USQ.COM 2K Wildcard unsqueeze SIG/M 152.14 USQ.DOC 3K / SIG/M 156.07 SQ-17.LBR 24K Updated squeeze program SIG/M 158.20 USQ.CMD 18K wildcard unsq for CP/M86 SIG/M 175.15 SQ.CMD 20K squeeze - CP/M-86 SIG/M 175.18 USQ.CMD 18K unsqueeze files SIG/M 176.06 UNSQ12.LBR 16K Z80 unsqueeze SIG/M 176.07 USQ119.AQM 9K USQ source code PC-Blue 057.08 SQ.EXE 20K squeeze PC-Blue 057.09 USQ.EXE 27K unsqueeze - A similar survey could be made of FIND programs, but they generally appear to have been coded in assembly language. Anyway our intention was to use FYNDE, so the matter was not pursued further. The essence of USQ.COM seems to fit in 256 bytes. An envoironment in which disk files can be opened and read, and the unsqueezed text displayed at the console or deposited on a disk, brings the program up to 1K or 2K, which is consistent with the size of the program marked with an arrow in the previous panel. The files UNSQ80.LIB and UNSQ86.LIB contain this 256 byte nucleus, in a form that can be used by ED.COM or ED.CMD to insert it into any program which has to deal with a squeezed input file. If the remainder of the program has been structured so that a call to a certain subroutine will deliver a new byte, with user-transparent refreshment of the input buffer from disk when exhausted, it is only necessary to interpose the decoding subroutine between the two. An initialization subroutine is also required, to separate the dictionary from the text, record the checksum, and recover the original file name. A compact unsqueezing program should have various applications. The simplest is to restore a squeezed program to its natural state. Beyond this, there are many programs which could take their data directly from a squeezed file. - This disk contains four programs that can handle squeezed data files. They have been coded for the Intel 8080 using ASM.COM, and the Intel 8086 using ASM86.CMD. They are included on the disk with both source and binary code. UNSQ Unsqueeze a squeezed file. This is the basic program for recovering a squeezed file. It will unsqueeze only a single named file (no wildcards) and will check that it does not overwrite another program of that name already on the disk. TYSQ Examine a squeezed file. If the file is an ASCII file,it will be passed to the console according to the style of the CP/M utility TYPE. If it is a binary file (original extension .COM or .CMD, according to the version) it will be presented in the style of DUMMP (SIG/M 165.05 - 08). FYNDE Search through a family of files on the given disk to find a keyword. In contrast to the original FIND, the keyword may be a regular expression (as in the UNIX "grep"); the line number may also be reported relative to a regular expression label. HJELP Display squeezed HELP files just as HELP displays normal files. - :Squeezing, and Huffman Code. A Huffman code is a variable length code in which the most frequent symbols are represented by the shortest binary numbers. In this it is reminiscent of the Morse code used in telegraphy, by which a few dots and dashes are used for frequent letters like e or s, and longer sequences for rarer letters like q. Since it is based on a mathematically rigorous optimization procedure, it can be used to get the code of shortest average length for a given message. It is easy enough to see how it works, without having to go through the formal mathematics. Suppose you have two symbols; then, no matter whether one is used much more ofter than the other, you need a bit to distinguish them. 0 for one, 1 for the other. If you have three symbols, you need two bits, but some of the capacity of two bits is going to waste because they could represent as many as four symbols. Suppose that in a message of a thousand symbols, 700 are A's, 200 are B's, and the remaining 100 are C's. If we assign A = 00, B = 01, C = 10 and leave 11 unused, the message will require 1000*2 or 2000 bits for its representation by binary numbers. If we assign A = 0, B = 10, C = 11, the message will require 700*1 + 200*2 + 100*2 or 1300 bits if they are strung out in sequence. The variable length code saves us 700/2000 or 35% in the length of the message. - Furthermore, the message can be decoded, because a 1 stands alone, but a 0 requires us to take the next bit too. In general, none of the three codes can be part of the beginning of any of the others. In the last example, suppose that there were a fourth symbol, D, and that in a thousand symbols the frequencies were now 700 A's, 200 B's, 80 C's and 20 D's. Always using two bits to distinguish A, B, C, D would still require a total of 2000 bits. Even though four symbols fills out two bits exactly, the assignment of 0 for A, 1xxx for the rest will shift the average number of bits per symbol towards 1. If we use 10 for B, 11 will be the prefix for C and D These two can be given 0 or 1 indifferently, so we finally have the assignment A = 0, B = 10, C = 110, D = 111. That A is coded in one bit is compensated by the three bit codes. The average is still around 2, but exactly : 700*1 + 200*2 + 80*3 + 20*3 = 1420. The shortening is 29%, the average number of bits per symbol 1.42. This scheme is essentially the Huffman code, but it is usually carried out in reverse. Here the short code is given to the most common symbol, but it might be that all the rest combined were even more numerous and deserve the short symbol as a class. Instead, we assign 0 and 1 arbitrarily to the two rarest symbols as the last bit of the code. They are then pooled into a common class for the next competition for rarest and the assignment of the penultimate bit. - Repeating this procedure, the number of classes is eventually reduced to two, for which the final assignment of bits is made. It is not entirely facetious to ask what to do when there is only one class, or more likely, just one single symbol. The only thing that can be done with it is to repeat it several times, and surely it would be more economical to state how many times to repeat it - as a binary number - than to actually repeat it that many times. Rather than n symbols, we need log(2) n symbols. This idea is likely to be the first one to occur to someone interested in reducing the size of a file of data, and is akin to the idea of replacing multiple spaces by a single tab character. It is especially valuable in data with many characters repeated in sequence, such as spaces or dashes. SQ.COM combines the two ideas - replacing repeated symbols by their count, and compressing text using a Huffman code - to produce a squeezed file. Any combination of diverse compression strategies can show inconsistencies, so care has to be exercised to show that the effect of one scheme does not negate the benefits of the other. In this combination, the symbol marking a repetition, and the individual repetition counts are new, hopefully themselves rare, symbols to contend with. - The question to be asked, is, how much is to be gained by squeezing files? Empirically, many squeezed files end up at 60% of their original size. It is known that Huffman codes work best when there is a great disparity in the frequencies of individual symbols; if the frequencies are approximately the same, the result is simply a binary enumeration of them all, and the Huffman code is simply binary code. ASCII text files waste a bit because the parity bit is generally ignored. Any scheme that dropped the parity bit and encoded the ASCII file into a bit stream would compress the text by 1/8, or 12.5% without any further effort. If the text were mostly lower case, six-bit ASCII would be appropriate, and another 12.5% realized. If the text were representable in 5-bit ASCII, that is if it were mostly letter text with few numbers, upper case letters and graphical symbols, we would be up to 37.5%, which is the empirical factor. One wonders whether the Huffman code isn't just showing up this structure of the majority of ASCII files, and whether a uniform coding scheme wouldn't give pretty much the same results as creating a specially tailored code for each file? In this respect, it is worth noting that ASCII files of numerical data can be squeezed by better than 50%, because the digits, commas and spaces form a set that can be represented in four bits. If some digits - zero say - are quite numerous, the squeezing can be somewhat better still. - SQ.COM generates a Huffman code by using single bytes as symbols - with the exception of the repeated strings which it encodes as a preliminary - which is a procedure that is bound to miss out on the structure that a given kind of file may possess. For example, an .ASM file is several times longer than its .COM file, even when it has no comments. This is due to the fact that Intel mnemonics typically use three characters to describe a single byte of operation code. Two byte addresses rarely have labels with as few as two bytes. If mnemonics and addresses were treated as individual entities, the squeezed file could be much shorter. Admittedly such encoding would be very specialized, and worthless when applied to something which was not an Intel 8080 assembly listing. File squeezing, to be effective, must take into account the structures existing in the file under consideration. - :FYNDE.A86 - FYNDE.ASM - Description FIND, in our experience, dates back to the CBBS (R) disk which we obtained from Ward Christensen and Randy Suess. It was included as a utility with the explanation that it was useful for tracing errors in their long linked assemblies, as well as for reconstructing damaged message disks. In this original form, an ambiguous file reference could be searched for several alternative phrases, all of which were specified on the command line. It is not hard to see that such a utility has many applications, or that we had considerable curiosity to disassemble it to see how it worked. We had also recently acquired MicroShell(R) and found that the vertical bar which separated alternatives in the FIND command line conflicted with the pipe symbol from UNIX. If some changes were to be made, others could be made also. First, the set of alternatives XXX|YYY|XXX|ETC which FIND used could be broadened to include wildcard symbols - @ for alphanumeric, ? for any single character - and a symbol for repetition. The application at the moment was tracing symbols in a disassembly listing, for which a line number from the beginning of a long listing was not very useful. Thus a "label" field was added - literally labels in the listing - and line numbers were rendered relative to the nearest label. - With the final step of enclosing alternatives within square brackets - the example above became [XXX!YYY!ZZZ!ETC] - and allowing subexpressions inside expressions, all the machinery existed to recognize any phrase defined as a regular expression. Thus a typical label might be X{?}: and would match X:, X0005:, X37A:, or whatnot - anything beginning with X and ending with a colon. The result of all this rearrangement became the first version of FYNDE. FIND was already a well written program, which we streamlined somewhat. Thus, bytes were transferred from an input buffer to a line buffer in a very orderly manner, by calling a subroutine. This is what makes it very simple to interpose a subroutine to decode the Huffman code for squeezed text, after converting the byte stream into a bit stream. FYNDE also includes some statistics missing from FIND - the number of lines on which the specified phrase was found (not counting repetitions on the same line) - and a grand total throughout the disk. In the new version the rest of an unwanted file can be skipped rather than terminate the session. Files recognized as binary (extension .COM or .CMD) are ignored. This is appreciated when an overly generous file specification has picked up some .BAK files or the binary files of the program under study. - The following panel gives an example of the use of FYNDE. The command line given was fynde tysq.* {?}:!xchg applied to this same disk, ICUAP9. There are several TYSQ files present, .COM, .CQM, and .CMD, in addition to .ASM, .AQM, and .A86. In the interest of obtaining a short result, the rare 8080 instruction xchg was requested, identified with respect to the last label preceding it. Labels are at the beginning of a line and terminated by a colon, but FYNDE would also pick up spurious labels if there were a colon in a comment, or a . False matches of this sort are rare, and do not diminish the usefulness of FYNDE very much. Note that when the squeezed file was scanned, its original name was shown, so that the original file could be consulted if need be. Also, no file was abandoned, and the scan was allowed to procede to completion. Thus complete frequency counts are shown. In the second panel following, both kinds of interrupts were carried out, to show their effect on the console display. - fynde tysq.* {?}:!xchg FYNDE.COM 04/01/84 ICUAP ------> File TYSQ .ASM chek:+ 1 xchg 1 lines found .COM file disregarded. ------> File TYSQ .AQM [original] : TYSQ .ASM chek:+ 1 xchg 1 lines found .COM file disregarded. ------> File TYSQ .A86 0 lines found .COM file disregarded. 2 instances in the entire disk - fynde tysq.* roco FYNDE.COM 04/01/84 ICUAP ------> File TYSQ .ASM + 77 sta roco ;bit rotation counter + 428 lxi h,roco ;bit rotation counter + 504 roco: ds 1 ;rotating bit counter 3 lines found .COM file disregarded. ------> File TYSQ .AQM [original] : TYSQ .ASM + 77 sta roco ;bit rotation counter -- Remainder of File Skipped -- .COM file disregarded. ------> File TYSQ .A86 + 65 mov roco,1 ;bit rotation counter + 263 dec roco ;bit rotation counter + 265 mov roco,8 ;bit rotation counter -- Search Terminated -- - In this panel, the command line fynde tysq.asm [jp!j[z!nc]] was given to show how alternatives can be nested. gives absolute line numbers. FYNDE.COM 04/01/84 ICUAP ------> File TYSQ .ASM + 56 jz tuto ;type tutorial message + 60 jz yext + 92 jz rchk ;read checksum + 111 jz ldic ;load code dictionary + 205 jnc dncs ;skip for 1, stay for 0 + 213 jz dnct + 215 jp dncu ;p means new offset + 283 jz typp + 406 jz exit ;return to CP/M 9 lines found 9 instances in the entire disk - Several error messages can be provoked. One of the more common is for the label or pattern not to have its parentheses balanced - either square brackets or curly brackets or both. Null alternatives are also rejected, but this is not necessary except to keep {} out of a loop. If the command line of the previous panel were erroneously given as fynde tysq.asm [jp!j[z!nc] the following reaction would ensue: FYNDE.COM 04/01/84 ICUAP -- Bad Pattern -- If you want to print the whole file, give the command line fynde tysq.asm {?} - :FYNDE.A86 - FYNDE.ASM - The Tutorial Message is the following: The command line FYNDE D:FILE.EXT EXPRESSION will search through all instances of FILE.EXT (which may be an ambiguous reference) on disk D to find lines containing EXPRESSION. Such lines will be presented on the console preceded by a line number, and classified by file. EXPRESSION may have the form LABEL!PATTERN or simply the form PATTERN. Both may contain: [p1!p2!...!pn] alternative strings {p1!p2!...!pn} repeated alternatives ? any single character @ for any alphanumeric: a-z, A-Z, 0-9 _ in place of horizontal tab When a label is present, lines will be numbered relative to the label. Example: X{?}:![call!ret] will list calls and returns relative to labels like X0100: or X33:. LABEL begins in column 1, PATTERN can begin in any column. Squeezed files will be searched as well as unsqueezed ones. Use ^C to quit, any other key skips rest of file. - :HJELP.A86 - HJELP.ASM - Description. HJELP was derived from HELP8080.COM (SIG/M 122.04) by disassembling it and inserting UNSQ80.LIB according to the procedure described therein. 80T86.CNV (SIG/M 173.04) was then used to create a version for the Intel 8086. HELP loads its HELP file into memory, from which parts of it can be studied under control of the program. Sometimes the file name is saved, and an entire new file loaded into memory, for which the same process takes place. These supplementary core loads are recursive down to a maximum depth, so that the previous file can be replaced at any time, or the original file can be recovered to start over again. An alternative to squeezing the HELP file is to divide it up into many of these supplementary files, but there is no reason that both procedures cannot be used. Since HELP takes text directly from memory (with ), a subroutine had to be introduced to decode the squeezed file. However, the structure of a squeezed file is such that it can be loaded whole into memory, and its dictionary can be consulted in place. This makes it quite easy to retain the overlay structure which HELP uses. Thus replaces throughout HJELP; a lookahead is sometimes needed. - It was another matter to mark the beginning of the panels with a sign bit, because this would destroy the coding in a squeezed file. The solution was to scan each file as it is loaded and construct a dictionary showing the beginning of each information section. As the information section is being read, a pushdown list of pointers is kept to mark the panel breaks. At the price of a slight delay for the original scan just after the file has been loaded, access to any of the information sections or backing up panels will be quite instantaneous. Further modification included scanning for HQP files as well as HLP files when HJELP is called with a null command line, accepting a given extension when it is given rather than forcing .HLP, and trying .HLP, then .HQP, in order when a HELP file name is given without an extension. Aside from its ability to handle .HQP files, HJELP appears to the user pretty much as though it were HELP. Thus, HELP2.HLP (SIG/M 103.04) and HELP.HLP (SIG/M 103.03) continue to be valid sources of information for constructing HELP files and using HJELP. - :HJELP.A86 - HJELP.ASM - Sample Tutorial Message. The real tutorial is in HELP.HLP, or whatever the user chooses to put in its place. If ZCPR is not in use and the system is fixed, this is a convenient place to put a menu of the HELP files in the system and to load any of them as overlays. HELP (from ZCPR) Modified for squeezed HELP files. Default HELP Facility Invoked - Available HELP Files are - HELP REC ICUAP9 - Squeezed HELP Files - REC XXX Type Any Character for Default Info (^C to Quit) - - :TYSQ.A86 - TYSQ.ASM - Description. TYSQ is a program which allows examination of squeezed files from the console. Unless the file is recognized as a binary file because its original extension was .COM (or .CMD), TYSQ resembles TYPE. In the other case, it gives a DDT-like hexadecimal dump with a marginal attempt to interpret each byte as an ASCII character. The two examples below show this. tysq tysq.aqm TYSQ.ASM ; ----------------------------------------------------------- ; ; "Squeezed" files can be generated by the program SQ.COM, ; (SIG/M 58.26) and then restored by the program USQ.COM ; (SIG/M 58.27), both of which appear to have been compiled ; from BDS "C". The scheme is to generate a Huffman code, ; after a preliminary pass which replaces all characters ; repeated Dump interrupted. - tysq tysq.cqm TYSQ.COM 0000 31 FB 04 3A 5D 00 FE 20 CA A7 03 21 66 00 7E FE 1..:].. ...!f.~. 0010 51 CA 1D 01 11 E2 03 CD 16 03 C3 32 03 CD 38 03 Q..........2..8. 0020 21 00 00 22 00 05 22 06 05 22 0A 05 21 0F 05 22 !.."..".."..!.." 0030 0D 05 3E 00 32 08 05 3E 01 32 FE 04 3E 10 32 0C ..>.2..>.2..>.2. 0040 05 32 09 05 01 00 00 CD 73 03 FE 76 C2 57 01 CD .2......s..v.W.. 0050 73 03 FE FF CA 60 01 11 E2 03 CD 16 03 C3 32 03 s....`........2. 0060 CD 68 03 22 04 05 CD 73 03 B7 CA AD 01 FE 2E C2 .h."...s........ Dump interrupted. - :TYSQ.A86 - TYSQ.ASM - Tutorial Message. Type Squeezed File/ICUAP/July 1, 1984. TYSQ [D:]FILE.EXT will unsqueeze FILE.EXT and type it at the console. The convention for squeezed files is that EXT has the form ?Q?. Press any key to stop the display. - :UNSQ.A86 - UNSQ.ASM - Description. UNSQ, in its versions for the Intel 8080 and the Intel 8086, was constructed to see how small USQ.COM could really be, if it were not compiled from "C" source code. It serves to restore a squeezed file to its original state, and will work for binary files as well as for ASCII files. As it happens, no doubt deliberately, the dictionary prefixed to the squeezed file can be used directly to decode the latter. Bits are shifted out of the squeezed data to the right, byte by byte. When one byte is exhausted the next is picked up, and so on. Rightshifting has its inconveniences (it complicates ordering squeezed data) but given the Intel byte order for double registers (low byte, high byte) which is also not lexcographic, byte pairs can be fetched with standard procedures to be rightshifted through a 16-bit register. This is no doubt an artifact of the "C" compiler. The dictionary consists of a series of pairs of byte pairs. The first is to be used if the incoming bit is 0, the second if it is 1. The use of each pair will depend on whether it is positive or negative. If negative, the low order byte is the complement of the coded character. If positive, it is the index of the next pair-pair to be consulted, using the next incoming bit. - SQ.COM seems to code a single byte, although not all possible bytes usually occur in a given file. In particular, it seems to code files as binary files even though they are really ASCII files. Thus terminal ^Z's are coded along with the rest of the file, when they are present. In order to know where the squeezed file ends, 100H is encoded as a special end-of-file symbol. Since it occurs only once an a given file, it will have one of the lowest frequencies. Since runs of bytes are replaced by a counter and a marker indicating that a run is present, this configuration has to be recognized when the file is unsqueezed. The byte 90H is used to mark repetition, the exact configuration being <90H>. When 90H was actually part of the source file, as it often is in binary files, the count is zero. Runs of three or more are coded with 90H, includes the byte which forms part of this triple. Thus, if bytes are dispatched as they are decoded, has to be reduced by 1 for the run generator. Since no code for one byte in a Huffman code is the prefix for any other code, we know that the decoding for each byte has finished when we encounter it in the dictionary. The next incoming bit must belong to the next character, so that the correct decoding of one byte is essential for the decoding of all that follow. Erroneous bits can readily disrupt the decoding sequence. - The structure of a squeezed file seems to be: signal that the file is squeezed two byte sum of all single bytes filename.ext of source file <00H> end of filename number of bytes coded by SQ.COM the squeezed data to fill out the last record UNSQ.ASM will not overwrite another file of the same name as the source file, and it will erase the unsqueezed file that it is producing if the checksum cannot be verified. This avoids leaving a possibly defective unsqueezed file on the disk. - :UNSQ.A86 - UNSQ.ASM - Tutorial Message. UNSQ.ASM will restore files which have been squeezed by SQ.COM or some similar program. UNSQ [X:]FILE[.QQQ] will read [X:]FILE.SQZ to produce [X:]FILE.UNS. - :UNSQ80.LIB ; ------------------------------------------------------------- ; UNSQ.LIB is a library file to be used with Digital Research's ; ED.COM to insert the set of subroutines which it contains ; into another program. The purpose of these subroutines is to ; read text from a file which has been squeezed by a program ; such as SQ.COM (SIG/M 58.26). ; ; UNSQ.LIB is divided into several sections, which can be left ; as they are, or integrated into the host program, according ; to your sense of orderliness. There are two subroutines that ; will be most likely to be called upon - unii and onsq. The ; first should be called for initialization, because it reads ; in the code directory and sets it up. It also separates out ; the file's checksum and the unsqueezed file name. The second ; is to be called each time that a character is to be taken out ; of the squeezed file. ; ; UNSQ.LIB assumes that the host program contains the following ; three subroutines: ; ; ibyt - which will deliver one character from an ; external input stream each time it is called. ; the external stream can be taken from memory, ; from a buffer that is periodically replenished ; from disk, or whatever. ; ; ufil - which will dispose of the characters forming ; the unsqueezed file name one by one. It might ; ignore them, store them in some file control ; block, or make some other use of them. ; ; ferm - which will type a fatal error message on the ; console and then return to CP/M - or take any ; other action which is desired. ; ; UNSQ.LIB Copyright (C) 1984 ; Universidad Autonoma de Puebla ; July 14, 1984 ; ; [Harold V. McIntosh, 14 July 1984] ; ------------------------------------------------------------- ; ------------------------------------------------------------- ; Section 1. Equivalences defining constants. ; ; CR, LF are used in the messages, and are ; traditionally defined in each program. ; ; csiz is the expected maximum number of ; characters which have been coded; 256 ; is appropriate if all possible bytes may ; have been encoded. ; ; = but some other pair ; might sometimes be used to mark a squeezed ; file CR equ 0DH LF equ 0AH csiz equ 256 idhi equ 0FFH idlo equ 076H ; ----------------------------------------------------------- ; Section 2. Initialization. ; ; will initialize two necessary variables ; (rcnt, roco), then start reading the squeezed file. ; This file must previously have been opened, and its ; suitability checked - eg that it has extension ?Q?. ; It may even have already been loaded into memory. ; All interaction with it is through , and ; the only requirement is that ibyt start reading the ; file from the beginning. ; ; unii goes through the following sequence: ; ; check squeezed marker ; read and store checksum ; read, make available name for original file ; check the length of the code dictionary ; load and store the code dictionary ; ; unii will offer a fatal error message if ; ; the marker is not present ; space will not accomodate the code table ; ; unii requires two host subroutines: ; ; ibyt returns the next byte from the source ; ufil disposes of the original file name ; ; No assumptions should be made concerning conservation ; of the 8080 registers by unii. ; ; unii will alter the following memory registers: ; ; rcnt - repeated character - set to 0 ; roco - count bits/byte - set to 1 ; cksm - checksum - set to file's checksum ; dode - code table - loaded with dictionary ; Initializations. unii: mvi a,0 sta rcnt ;repetition count mvi a,1 sta roco ;bit rotation counter ; Set up code table. Squeezed files seem to begin with the ; hexadecimal word (FF76) stored in Intel byte order, which ; would not be likely to start an unsqueezed file. cota: call ibyt ;fetch one byte from input stream cpi idlo jnz cote ;mssg: 'not squeezed file' call ibyt ;fetch one byte from input stream cpi idhi jz rchk ;read checksum cote: lxi d,nsqz ;'not a squeezed file' jmp ferm ;fatal error message ; The "squeezed" marker is followed by a two-byte checksum, ; which is the simple sum of all the one-byte characters in ; the source file, carried as a two byte sum modulo 2**16. rchk: call iwor ;fetch two bytes from input stream shld cksm ;checksum ; Unsqueezed file name. It is an ASCII sequence, may be lower ; case if SQ.COM received it in response to a prompt, ending ; with a zero byte. Some trash may be present if SQ.COM wasn't ; used correctly. If the file name is to be used for something, ; such as defining an output file or checking the file type, ; this is the place to insert the appropriate code. luup: call ibyt ;fetch one byte from input stream push psw call ufil ;process unsqueezed filename pop psw ora a jnz luup ; Load code dictionary. It is preceded by its two-byte length, ; and consists of a series of pairs of two-byte addresses. For ; each bit in the code, select the first element (0) or the ; second (1) element of the pair. If the pair is positive, it ; is the table entry (code + 4*index) at which to continue with ; the next bit. If the pair is negative, it is the complement ; of the coded ASCII character (low order byte except for [end]). ldic: call iwor ;fetch two bytes from input stream dad h dad h mov c,l mov b,h lxi h,csiz mov a,l sub c mov a,h sbb b jnc ldii lxi d,ntab ;'insufficient dictionary' jmp ferm ;fatal error message ldii: lxi h,code ;decoding table ldij: push b push h call ibyt ;fetch one byte from input stream pop h pop b mov m,a inx h dcx b mov a,c ora b jnz ldij ret ; ---------------------------------------------------------- ; Section 3. Read next unsqueezed byte. ; ; will withdraw a sufficient ; number of bits from the source file to ; generate one byte, returning it in the ; accumulator. The end of the source file ; will be signified by c = 1; otherwise ; c = 0 will prevail. Reading of the source ; file may terminate before this condition ; is reached, for example if the source ; was an ASCII file terminating with ^Z, ; or a HEX file terminated by a final line. ; ; The preservation of registers cannot be ; guaranteed; surround with ; pushes and pops is such continuity is ; required. ; ; onby will solicit the fatal error message ; subroutine if the [eof] marker signified ; by c = 1 has been reached without the ; checksum having balanced. This protection ; can be secured for files that were not ; read in their entirity by reading out the ; remaining bytes in a dummy loop. ; ; onby alters the following memory registers: ; ; lach - last character deposited ; rcnt - repetition count ; roco - rotating bit counter ; roby - rotating byte ; cksm - checksum ; Type unsqueezed code. Beware of the [end] marker, ; and also the repeat marker 90H which occurs in the ; combination <90H>. When count is zero, ; 90H itself is intended; otherwise is to be ; repeated times, including the occurrence just ; before 90H was seen. onsq: lda rcnt ;repetition count ora a jnz onsr call dnch ;decode next character jc vchk ;verify the checksum cpi 090H ;repeat last character jnz onsu ;normal character call dnch ;decode next character ora a jz onss dcr a onsr: dcr a sta rcnt ;repetition count lda lach jmp achk onss: mvi a,090H jmp achk onsu: sta lach ;last character typed jmp achk ; Decode next character. dnch: lxi h,code ;decoding table dncr: call ibit ;read next bit jnc dncs ;skip for 1, stay for 0 inx h inx h dncs: mov e,m ;get next offset inx h mov d,m mov a,d cpi 0FEH ;FEFF means [end] jz dnct ora a jp dncu ;p means new offset mov a,e ;m means complemented char cma stc cmc ret dnct: stc ;flag [end] with carry bit ret ; Calculate +4*. dncu: lxi h,code ;decoding table dad d dad d dad d dad d jmp dncr ; Read one bit at a time. ibit: push h lxi h,roco ;bit rotation counter dcr m jnz ibiu mvi m,8 call ibyt ;fetch one byte from input stream sta roby ;rotating byte ibiu: lda roby ;rotating byte rar sta roby ;rotating byte pop h ret ; Read one word. iwor: call ibyt ;fetch one byte from input stream mov l,a push h call ibyt ;fetch one byte from input stream pop h mov h,a ret ; Accumulate checksum. achk: lxi h,cksm mov b,a mov a,m sub b mov m,a inx h mov a,m sbi 0 mov m,a mov a,b stc cmc ret ; Verify the checksum. vchk: lhld cksm ;checksum mov a,l ora h stc rz ;return to CP/M lxi d,chno ;'Checksum failure.' jmp ferm ;fatal error message ; ----------------------------------------------------------- ; Section 4. Host subroutines. ; ; The following subroutines must be provided ; by the host program: ; ; ibyt - deliver the next byte from ; the source program, in register A, ; on each call. Need not preserve any ; 8080 registers. ; ; ufil - must absorb one byte from ; register A on each call. These bytes ; will be the information that is ; nominally the name of the source ; file, SOURCE.EXT, including the dot. ; ufil should be prepared for possible ; variants, however. ufil will be given ; final zero byte to signal the end of ; the bytestream it will receive. Of ; course, ufil could be imbedded in ; place of a call to it. ; ; ferm - send a fatal error message to ; the console. Normally control would ; return to CP/M, but some other action ; may be taken. ibyt: ret ;fetch byte from the input stream ufil: ret ;process unsqueezed filename ferm: ret ;fatal error message ; ------------------------------------------------------------- ; Section 5. Messages and memory locations used by the program. ntab: db CR,LF,'Insufficient dictionary space.$' nsqz: db CR,LF,'Not a squeezed file.$' chno: db CR,LF,'Checksum failure.$' lach: ds 1 ;last character typed rcnt: ds 1 ;repetition count roco: ds 1 ;rotating bit counter roby: ds 1 ;rotating byte cksm: ds 2 ;checksum code: ds 4*csiz ;decoding table end - :UNSQ86.LIB ; ------------------------------------------------------------- ; UNSQ.LIB is a library file to be used with Digital Research's ; ED.CMD to insert the set of subroutines which it contains ; into another program. The purpose of these subroutines is to ; read text from a file which has been squeezed by a program ; such as SQ.COM (SIG/M 58.26). ; ; UNSQ.LIB is divided into several sections, which can be left ; as they are, or integrated into the host program, according ; to your sense of orderliness. There are two subroutines that ; will be most likely to be called upon - unii and onsq. The ; first should be called for initialization, because it reads ; in the code directory and sets it up. It also separates out ; the file's checksum and the unsqueezed file name. The second ; is to be called each time that a character is to be taken out ; of the squeezed file. ; ; UNSQ.LIB assumes that the host program contains the following ; three subroutines: ; ; ibyt - which will deliver one character from an ; external input stream each time it is called. ; the external stream can be taken from memory, ; from a buffer that is periodically replenished ; from disk, or whatever. ; ; ufil - which will dispose of the characters forming ; the unsqueezed file name one by one. It might ; ignore them, store them in some file control ; block, or make some other use of them. ; ; ferm - which will type a fatal error message on the ; console and then return to CP/M - or take any ; other action which is desired. ; ; UNSQ.LIB Copyright (C) 1984 ; Universidad Autonoma de Puebla ; July 14, 1984 ; ; [Harold V. McIntosh, 14 July 1984] ; ------------------------------------------------------------- ; ------------------------------------------------------------- ; Section 1. Equivalences defining constants. ; ; CR, LF are used in the messages, and are ; traditionally defined in each program. ; ; csiz is the expected maximum number of ; characters which have been coded; 256 ; is appropriate if all possible bytes may ; have been encoded. ; ; = but some other pair ; might sometimes be used to mark a squeezed ; file CR equ 0DH LF equ 0AH csiz equ 256 iden equ 0FF76H ; ----------------------------------------------------------- ; Section 2. Initialization. ; ; will initialize two necessary variables ; (rcnt, roco), then start reading the squeezed file. ; This file must previously have been opened, and its ; suitability checked - eg that it has extension ?Q?. ; It may even have already been loaded into memory. ; All interaction with it is through , and ; the only requirement is that ibyt start reading the ; file from the beginning. ; ; unii goes through the following sequence: ; ; check squeezed marker ; read and store checksum ; read, make available name for original file ; check the length of the code dictionary ; load and store the code dictionary ; ; unii will offer a fatal error message if ; ; the marker is not present ; space will not accomodate the code table ; ; unii requires two host subroutines: ; ; ibyt returns the next byte from the source ; ufil disposes of the original file name ; ; No assumptions should be made concerning conservation ; of the 8080 registers by unii. ; ; unii will alter the following memory registers: ; ; rcnt - repeated character - set to 0 ; roco - count bits/byte - set to 1 ; cksm - checksum - set to file's checksum ; dode - code table - loaded with dictionary ; Initializations. unii: mov rcnt,0 ;repetition count mov roco,1 ;bit rotation counter ; Set up code table. Squeezed files seem to begin with the ; hexadecimal word (FF76) stored in Intel byte order, which ; would not be likely to start an unsqueezed file. cota: call iwor ;fetch one byte from input stream cmp bx,iden jz rchk ;mssg: 'not squeezed file' mov dx,(offset nsqz) ;'not a squeezed file' jmp ferm ;fatal error message ; The "squeezed" marker is followed by a two-byte checksum, ; which is the simple sum of all the one-byte characters in ; the source file, carried as a two byte sum modulo 2**16. rchk: call iwor ;fetch two bytes from input stream mov cksm,bx ;checksum ; Unsqueezed file name. It is an ASCII sequence, may be lower ; case if SQ.COM received it in response to a prompt, ending ; with a zero byte. Some trash may be present if SQ.COM wasn't ; used correctly. If the file name is to be used for something, ; such as defining an output file or checking the file type, ; this is the place to insert the appropriate code. luup: call ibyt ;fetch one byte from input stream push ax call ufil ;process unsqueezed filename pop ax or al,al jnz luup ; Load code dictionary. It is preceded by its two-byte length, ; and consists of a series of pairs of two-byte addresses. For ; each bit in the code, select the first element (0) or the ; second (1) element of the pair. If the pair is positive, it ; is the table entry (code + 4*index) at which to continue with ; the next bit. If the pair is negative, it is the complement ; of the coded ASCII character (low order byte except for [end]). ldic: call iwor ;fetch two bytes from input stream cmp bx,(offset csiz) jc ldii jz ldii mov dx,(offset ntab) ;'insufficient dictionary' jmp ferm ;fatal error message ldii: add bx,bx add bx,bx mov cx,bx mov si,(offset code) ;decoding table ldij: push cx push si call ibyt ;fetch one byte from input stream mov [si],al pop si pop cx inc si loop ldij ret ; ---------------------------------------------------------- ; Section 3. Read next unsqueezed byte. ; ; will withdraw a sufficient ; number of bits from the source file to ; generate one byte, returning it in the ; accumulator. The end of the source file ; will be signified by c = 1; otherwise ; c = 0 will prevail. Reading of the source ; file may terminate before this condition ; is reached, for example if the source ; was an ASCII file terminating with ^Z, ; or a HEX file terminated by a final line. ; ; The preservation of registers cannot be ; guaranteed; surround with ; pushes and pops is such continuity is ; required. ; ; onby will solicit the fatal error message ; subroutine if the [eof] marker signified ; by c = 1 has been reached without the ; checksum having balanced. This protection ; can be secured for files that were not ; read in their entirity by reading out the ; remaining bytes in a dummy loop. ; ; onby alters the following memory registers: ; ; lach - last character deposited ; rcnt - repetition count ; roco - rotating bit counter ; roby - rotating byte ; cksm - checksum ; Type unsqueezed code. Beware of the [end] marker, ; and also the repeat marker 90H which occurs in the ; combination <90H>. When count is zero, ; 90H itself is intended; otherwise is to be ; repeated times, including the occurrence just ; before 90H was seen. onsq: mov al,rcnt ;repetition count or al,al jnz onsr call dnch ;decode next character jc vchk ;verify the checksum cmp al,090H ;repeat last character jnz onsu ;normal character call dnch ;decode next character or al,al jz onss dec al onsr: dec al mov rcnt,al ;repetition count mov al,lach jmp achk onss: mov al,090H jmp achk onsu: mov lach,al ;last character typed ; jmp achk ; Accumulate checksum. achk: mov ah,0 sub cksm,ax stc cmc achr: ret ; Verify the checksum. vchk: cmp cksm,0000 ;checksum stc jz achr mov dx,(offset chno) ;'Checksum failure.' jmp ferm ;fatal error message ; Decode next character. dnch: mov bx,(offset code) ;decoding table dncr: call ibit ;read next bit jnc dncs ;skip for 1, stay for 0 inc bx inc bx dncs: mov ax,[bx] ;get next offset cmp ax,0FEFFH ;FEFF means [end] jz dnct or ax,ax jns dncu ;p means new offset not al ;m means complemented char stc cmc ret dnct: stc ;flag [end] with carry bit ret ; Calculate +4*. dncu: mov bx,(offset code) ;decoding table add ax,ax add ax,ax add bx,ax jmp dncr ; Read one bit at a time. ibit: dec roco ;bit rotation counter jnz ibiu mov roco,8 call ibyt ;fetch one byte from input stream mov roby,al ;rotating byte ibiu: rcr roby,1 ;rotating byte ret ; Read one word. iwor: call ibyt ;fetch one byte from input stream mov bl,al push bx call ibyt ;fetch one byte from input stream pop bx mov bh,al ret ; ----------------------------------------------------------- ; Section 4. Host subroutines. ; ; The following subroutines must be provided ; by the host program: ; ; ibyt - deliver the next byte from ; the source program, in register A, ; on each call. Need not preserve any ; 8080 registers. ; ; ufil - must absorb one byte from ; register A on each call. These bytes ; will be the information that is ; nominally the name of the source ; file, SOURCE.EXT, including the dot. ; ufil should be prepared for possible ; variants, however. ufil will be given ; final zero byte to signal the end of ; the bytestream it will receive. Of ; course, ufil could be imbedded in ; place of a call to it. ; ; ferm - send a fatal error message to ; the console. Normally control would ; return to CP/M, but some other action ; may be taken. ibyt: ret ;fetch byte from the input stream ufil: ret ;process unsqueezed filename ferm: ret ;fatal error message ; ------------------------------------------------------------- ; Section 5. Messages and memory locations used by the program. ntab db CR,LF,'Insufficient dictionary space.$' nsqz db CR,LF,'Not a squeezed file.$' chno db CR,LF,'Checksum failure.$' lach rb 1 ;last character typed rcnt rb 1 ;repetition count roco rb 1 ;rotating bit counter roby rb 1 ;rotating byte cksm rw 1 ;checksum code rb 4*csiz ;decoding table end - :Bibliography. "Dr. Electric" RCPM and RPC Systems: An Overview Microsystems, v.5, #6 (June 1984) (ISSN #0199-7955) Donald E. Knuth The Art of Computer Programming Volume 1 / Fundamental Algorithms Addison-Wesley Publishing Company Reading Massachussetts - 1973 (2 ed) ISBN 0-201-03809-9 Donald E. Knuth The Art of Computer Programming Volume 3 / Sorting and Searching Addison Wesley Publishing Company Reading Massachussetts - 1973 - J.P. Tremblay and P.G. Sorenson An Introduction to Data Structures with Applications McGraw-Hill Book Company New York New York - 1976 - :[end] [Harold V. McIntosh, 24 July 1984]