PROGRAMMING FOR PERFORMANCE PART 2 by Lee A. Hart The Computer Journal, Issue 40 Reproduced with permission of author and publisher Know The Hardware Truly efficient software has an intimate, almost incestuous relationship with its hardware. They merge so thoroughly as to become inseparable; neither makes any sense without the other. This requires that you, the programmer, must TOTALLY understand the hardware. I cannot stress this point too strongly. The strengths and weaknesses of the hardware influence program structure at every level, not just the low-level drivers. A system with weak disk I/O will be slow and unresponsive if your program relies on overlays. A shallow keyboard buffer requires frequent checks to avoid missing keys. The characteristics of the console device determines your whole approach to data displays. If you try to hide from these limitations in a high-level language, your program will work as if it were written in BASIC 101. Let's consider some actual case histories of what can be gained by paying attention to the hardware. CASE #1 A customer needed a faster way to transfer data between two computers. He had been using a serial port at 9600 baud but complained that it was too slow and tied up the computer's serial port. Hardware mods were ruled out. After study, I found that each computer had unused handshake lines in its RS-232 port. A special "Y" cable was built to cross-connect two of these lines, providing one bit of serial I/O in each direction. A "software UART" program was then written to transfer data between the two machines. This worked to about 30K bits per second before timing dither (due to interrupts, memory refresh, etc.) caused errors. The serial port's UART could be programmed to generate an interrupt when the handshake line went low. Therefore, an interrupt-driven protocol with handshaking was devised. A '0' was sent by pulling the output low until the other computer echoed the low on its output. A '1' was sent by pulsing the output low and immediately back high and waiting until the other system echoed it. The data rate increased to over 100K bits per second, and transfers were now unaffected by disk I/O, keyboard activity, etc. CASE 2 The firmware for a CRT terminal was to be upgraded to run 38400 bits per second without handshaking. Now, 38400 bps is fast, only 260 microseconds per character (about 75 instructions for a 3 MHz Z80). Š The slowest routines need the most attention. For example, clear-line was accomplished by moving the stack pointer to the end of the line and executing 36 PUSH HL instructions. The interrupt handler needed a 4-level stack, so the last 8 bytes were cleared normally. Clear-screen used 25 iterations of clear-line. This still isn't fast enough to complete every ESC sequence before the next one is received. This calls for an interrupt-driven system. Each character received generates an interrupt. The interrupt handler pushes the character into one end of a FIFO (First-In-First-Out) buffer in memory. The main program pops characters out the other end and processes them. The FIFO fills while we process slow commands like clear-screen and empties back out during fast commands. But what if some idiot sends a long string of slow commands (like 100 clear-screens in a row)? The FIFO would eventually overflow, and data would be lost. I prevented this with "look-ahead" logic. When the interrupt handler spots a clear-screen command, it sets a flag so MAIN expects it. MAIN can then ignore unnecessary commands (no sense altering a screen that's about to be cleared). Scrolling is one of the most difficult actions. The obvious algorithm is to block move lines 2-24 up 1, then clear line 24. But that's what IBM did on the PC, and we all know how well that worked. So examine the 6845 CRT controller. The Start-Address register holds the address of the first character on the screen, the one displayed in the top left corner. If we add 80 to it, line 2 instantly becomes the top line, and we've scrolled the whole screen up a line. All that remains is to clear the 80 bytes that form the new 24th line, for which we have a fast routine. Each scroll moves the start address up another 80 bytes. This obviously can't go on indefinitely, so the original program spent a great deal of time checking for overflow outside its 2K block of screen RAM (F800-FFFF). For instance, the old code read: ld (hl),a ; put character on screen inc hl ; advance to next ld a,h ; get new address or 0F8h ; if overflow to 0000, ld h,a ; force it to F800-FFFF But is this really necessary? The schematic revealed that the 2K RAM was partially decoded and actually occupied 16K in the Z80's address space (C000-FFFF). It's far easier to insure that an address lies within this range: ld (hl),a ; put character on screen res 6,h ; insure we don't wrap to 0000 inc hl ; advance to next CASE #3 Š Fast Disk I/O. Way back in 8 B.C. (eight years Before Clones) I had an S-100 system. Its 8080 CPU blazed along at 1.843 MHz, through 32K of RAM spread over half a dozen furnace boards. Two Shugart SA-801R single-sided 8" drives provided disk storage, with CP/M 1.4 tying it all together. That old war horse and I fought many battles together, until it finally died the Death-of-1000-Intermittents. Many of its "features" I'd rather forget, but it had one outstanding attribute: the fastest floppies I've ever seen. Warm boots were done before your fingers were off the keys; Wordstar loaded in under a second; PIP copied files at 10K bytes/sec. All without a fast CPU, DMA, vectored interrupts, or even a disk controller IC. The "controller" was just a bunch of TTL chips implementing a parallel port, an 8-bit shift register, and a CRC checkcode generator. The real work was done by the CPU, byte-banging out the IBM 3740 SD/DD format in software. How good was it? An 8" disk spins at 360 rpm, or 6 revs/sec. Each track held 6.5K (26 double-density sectors of 256 bytes each). That makes the theoretical maximum transfer rate 6.5K x 6 = 39K bytes/sec. It actually achieved 50% of this, or 20K bytes/sec. Few modern micros come anywhere near this level of performance. The Kaypro I wrote this article on creeps through the disk at 4K/sec. My PC clone is closer, at 12K/sec. The problem is that the CPU spends most of its time in wait loops; waiting for the drive motor to start, for the head to load, for an index hole, for a certain sector to come around on the disk. The capabilities of fast CPUs, elaborate interrupt systems, DMA, and fancy disk controllers are thrown away by crude software. The CPU has better things to do. If the disk isn't ready when an application program needs it, the BIOS should start the task, save the data in a buffer, and set up an interrupt to finish the task later when the disk is REALLY ready. The time lost to wait loops is thus reclaimed to run your application programs. That's how my antique worked. The BIOS maintained a track buffer in RAM. The first read from a particular track moved the head to the desired track and read the whole thing into the buffer. Further reads from that track simply came from RAM, taking virtually no time at all. Similarly, writes to a sector on the current track just put data in the buffer and marked it as changed. The actual write was performed later, when a new track was selected for read/write, or just before the drive timed out from a lack of disk activity. Physical track reads/writes were fast as well. The key was to simply begin wherever the head was. After seeking to the desired track, it read the ID# of each sector encountered and transferred it to/from the appropriate place in the RAM buffer. No need to find the index hole, wait for a particular sector ID#, or worry about interleave; one revolution got it all. Such a system must be implemented carefully. CP/M does not expectŠdelayed error messages, which can produce some odd results. For instance, a BDOS read error might be reported when the real cause was a write error in flushing the previous track buffer to disk. Also, modern drives do not have door locks to prevent disk removal if unwritten data remains in the track buffer. The main factor limiting my S-100 system's performance was the slow CPU and lack of DMA. A double-density 8" disk has a peak data transfer rate of 500K bits/sec, which only allows 16 microseconds between bytes. This required polled I/O where the CPU was 100% devoted to the disk during actual reads/writes. 5-1/4" disks have a slower maximum transfer rate, but modern hardware has advantages that can make up for it. A normal 5-1/4" disk spins at 300 rpm, or 5 rev/sec. Assuming 9 sectors of 512 bytes per track, the maximum transfer rate is 22.5K bytes/sec. The peak data rate is 250K bits/sec, or 32 microseconds per byte. This is slow enough for a 4 MHz Z80 to (barely) handle it on an interrupt basis. Here's an interrupt handler to read 256 bytes from a disk controller chip at 32 microseconds max. per byte: T-states 23 ; time to finish longest instruction 13 ; Z80 interrupt mode 0 or 1 response 11 int: push af ; save registers used 11 in a,(data) ; read data byte from disk controller 13 next: ld (buffer),a ; store it in buffer (a variable) 13 ld a,(next+1) ; get buffer address 4 inc a ; increment 13 ld (next+1),a ; save for next time 7 jr nz,done ; if end of page, done 10 pop af ; else restore registers 10 ret ; and return ---- 128 T-states max = 32 microseconds with a 4 MHz Z80 But this routine barely squeaks by. It can't use interrupt mode 2 (which adds 6 T-states to the response time) or signal Z80 peripherals that the interrupt is complete with an RETI (which adds 4 T-states). It's limited to a 256-byte sector. Worse, some disk controller chips need processing time of their own. The popular Western Digital FD179x series only allows 27.5 microseconds for each byte. So we have to get clever again. The following example reads pairs of bytes, the first on an interrupt and the second by polled I/O. This improves performance to allow interrupt mode 2, larger sector sizes, and the slow response time of a FD179x chip: T-states 23 ; time to finish longest instruction 19 ; Z80 interrupt mode 2 response time 11 int: push af ; save A and flags 11 in a,(data) ; read 1st byte from disk controller Š 11 push hl ; save HL 10 next: ld hl,buffer ; get buffer address (a variable) 7 ld (hl),a ; store byte in buffer 6 inc hl ; advance buffer pointer 6 inc hl ; for next interrupt 16 ld (next+1),hl; & store it 6 dec hl ; point to current address 11+11 check:in a,(status) ; check disk controller status 4+ 4 rra ; if not busy (bit 0=1), 7+ 7 jr nc,done ; then we're done 4+ 4 rra ; if next byte not ready (bit 1=0), 12+ 7 jr nc,check ; then loop until it is 11 in a,(data) ; get 2nd byte from disk controller 7 ld (hl),a ; & store it in buffer 10 pop hl ; restore registers 10 pop af 14 reti ; return ---- 188 or 226 T-states (for 1 or 2 passes through status loop) This routine reads bytes from the controller chip within 17.75 microseconds worst-case. Interrupt overhead averages 80% for a 4 MHz Z80, leaving 20% for the main program execution. The peculiar way of incrementing the address pointer minimizes the worst-case delay from an interrupt or status flag change until the byte is read. We want to maximize the chance that the second character is ready the first time the status is checked. Why improve your disk system? Because, as a practical matter, there's more to be gained by improving it than any other change you could make. It's disk I/O that sets the pace, not CPU speed or memory size. Users almost never wait on CPU speed; it's the disk that keeps you twiddling your thumbs with the keyboard ignored, the screen frozen, and the disk drive emitting Bronx cheers. Put a Commodore 64's tinkertoy disk system on an AT­ clone, and you'd have high-priced junk that only a masochist would use. Conversely, the AT's DMA-based disk I/O would transform a C64 into a fire­ breathing dragon that would eat its competition alive. Algorithms When a hardware engineer sits down to design a circuit, he doesn't begin with a blank sheet of paper. He has a vast library of textbooks, data sheets, and catalogs of standard circuits to choose from. Most of the task is simply connecting off-the-shelf components into one of these standard configurations, modifying them as necessary to satisfy any unique requirements. Algorithms are to programmers what IC chips are to hardware designers. Just as the engineer builds a library of standard parts and circuits, every programmer must continually build his own algorithm collection. Whether it's a shoebox full of magazine clippings or a carefully indexed series ofŠnotebooks, start NOW. Programming textbooks tend to concentrate on traditional computer algorithms for floating-point math, transcendental functions, and sorting routines. The old standby is Knuth's "The Art of Programming". Hamming's "Numerical Methods for Scientists and Engineers" explains the basics of iterative calculations. "Digital Computation and Numerical Methods" by Southworth and Deeleeuw provides detailed flowcharts and sample code as well. Magazines are a great source and tend to be more down-to-earth and closer to the state of the art. Read carefully! Good algorithms may be expressed in BASIC listings, assembly code for some obscure processor, pocket calculator key sequences, and even disguised as circuit diagrams. Professional journals like EDN or Computer Design are often better than the popular magazines, which have pretty much abandoned education in favor of marketing. Especially check out back issues. The cruder the hardware, the trickier the algorithms had to be to make up for it. Manufacturers' technical literature is a gold mine. Get the manufacturers' own manuals, not some boiled-down paperback from the bookstore. They won't be models of clarity but are full of hidden gold. Read everything, hardware and software manuals, data sheets, application notes, etc. User groups are the traditional source of solutions to specific problems. Even better, they provide actual implementations in printed listings, on disk, or even by modem. Don't waste time reinventing the wheel. Learn from others what works, and what doesn't. Some of the best (and worst) algorithms I know were found by disassembling existing programs. And once you find a good algorithm, recycle it. That clever sort routine for an antique 8008 may be the foundation of the fastest '386 sort yet! Conclusion These techniques are not new; in fact old-timers will recognize many of them from the early days of computing when hardware limitations were more severe. However, they have fallen into disuse. A whole generation of programmers has been taught that such techniques have no place in modern structured programming. The theory goes something like this: Programs written in a high-level language are faster and easier to write, debug, document, and maintain. Memory and speed are viewed as infinite resources, so the performance loss is unimportant. Programs should be totally generic; it is the compiler's or run-time library's job to worry about the hardware interface. These rules make sense in a mainframe environment, where the hardware resources are truly awesome, and teams of programmers spend years working onŠone application. But they impose severe penalties on a microcomputer system. The user must pay for the programmer's luxuries with higher hardware cost and lackluster performance. It's easy to forget that "microcomputer" literally means "one millionth of a computer". Microprocessors make abysmally bad CPUs. Build a computer with one, and you'll wind up needing $5000 worth of memory and peripherals to support a $5 CPU chip. But micros make superlative controllers. That's what they were designed for, and what they do best. A single microcomputer can replace dozens of boards and hundreds of ICs with as little as a single chip. That's why 90% of all microprocessors go into non-computer uses: calculators, auto emission controls, home entertainment equipment, industrial controls, and the like. Of 30 million Z80s sold last year, fewer than 1 million went into computers. Programming a controller is different than a computer. Most applications demand real-time multi-tasking capabilities, and there is never enough speed or memory. Inputs and outputs are physical hardware devices, not abstract data structures, so the code must inevitably be hardware-dependent. Computer languages are just not cut out for this sort of thing. The question is not, "How do I write a computer program to handle this data?" Instead, you should ask yourself, "How must I manipulate this hardware to do the job?" The techniques in this article may be out of place in the bureaucracy of a large computer but are right at home in the wild­ west world of a microcomputer. Lest you think this has nothing to do with a "real" computer like your PC clone, consider this. Instead of a '286 with 1 meg of memory, suppose it contained ten Z80 controller boards, each with 64K of memory and a fast network to tie them together. Each Z80 handles a different device: keyboard, screen, printer, modem, and one for each disk. The rest are free to run application programs, several at a time! Suppose you're doing word processing on this system. The keyboard does spelling correction on data entry. The screen Z80 displays your text in bit-mapped fonts to match the printer's Z80, which is simultaneously printing a file. The Z80 running the word processor itself suffers no annoying pauses or hesitation, since disk I/O is handled instantaneously via each drive's Z80 track buffer. Meanwhile, the modem's Z80 is downloading a file while another assembles a program. Pop-up utilities are ready and waiting in still other Z80s in case they're needed. Such a system would clearly have half the hardware cost of a PC, yet would outperform it by a wide margin. True multi-tasking becomes child's play with multiple processors. More processors can be readily added for even higher performance, or removed to save cost (or continue operation while waiting for a replacement). If the computer scientists really want to further the micro-revolution, they should stop trying to force antiquated mainframe languages onto microsŠand concentrate on developing tools to maximize the use of micros as they are! [This article was originally published in issue 40 of The Computer Journal, P.O. Box 12, South Plainfield, NJ 07080-0012 and is reproduced with the permission of the author and the publisher. Further reproduction for non- commercial purposes is authorized. This copyright notice must be retained. (c) Copyright 1989, 1991 Socrates Press and respective authors]