PROGRAMMING FOR PERFORMANCE
                                  PART 2

                               by Lee A. Hart
                        The Computer Journal, Issue 40
                          Reproduced with permission
                           of author and publisher

                             Know The Hardware

   Truly efficient software has an intimate, almost incestuous relationshipН
with its hardware.  They merge so thoroughly as to become inseparable;Н
neither makes any sense without the other.

   This requires that you, the programmer, must TOTALLY understand theН
hardware.  I cannot stress this point too strongly.  The strengths andН
weaknesses of the hardware influence program structure at every level, notН
just the low-level drivers.  A system with weak disk I/O will be slow andН
unresponsive if your program relies on overlays.  A shallow keyboard bufferН
requires frequent checks to avoid missing keys.  The characteristics of the Н
console device determines your whole approach to data displays.  If you tryН
to hide from these limitations in a high-level language, your program willН
work as if it were written in BASIC 101.  Let's consider some actual caseН
histories of what can be gained by paying attention to the hardware.

CASE #1

   A customer needed a faster way to transfer data between two computers. Н
He had been using a serial port at 9600 baud but complained that it was tooН
slow and tied up the computer's serial port.  Hardware mods were ruled out.

   After study, I found that each computer had unused handshake lines in itsН
RS-232 port.  A special "Y" cable was built to cross-connect two of theseН
lines, providing one bit of serial I/O in each direction.  A "software UART"Н
program was then written to transfer data between the two machines.  ThisН
worked to about 30K  bits per second before timing dither (due toН
interrupts, memory refresh, etc.) caused errors.

   The serial port's UART could be programmed to generate an interrupt whenН
the handshake line went low.  Therefore, an interrupt-driven protocol withН
handshaking was devised.  A '0' was sent by pulling the output low until theН
other computer echoed the low on its output.  A '1' was sent by pulsing theН
output low and immediately back high and waiting until the other systemН
echoed it.  The data rate increased to over 100K bits per second, andН
transfers were now unaffected by disk I/O, keyboard activity, etc.

CASE 2

   The firmware for a CRT terminal was to be upgraded to run 38400 bits perН
second without handshaking.  Now, 38400 bps is fast, only 260 microsecondsН
per character (about 75 instructions for a 3 MHz Z80).
К   The slowest routines need the most attention.  For example, clear-lineН
was accomplished by moving the stack pointer to the end of the line andН
executing 36 PUSH HL instructions.  The interrupt handler needed a 4-levelН
stack, so the last 8 bytes were cleared normally.  Clear-screen used 25Н
iterations of clear-line.

   This still isn't fast enough to complete every ESC sequence before theН
next one is received.  This calls for an interrupt-driven system.  EachН
character received generates an interrupt.  The interrupt handler pushes theН
character into one end of a FIFO (First-In-First-Out) buffer in memory.  TheН
main program pops characters out the other end and processes them.  The FIFOН
fills while we process slow commands like clear-screen and empties back outН
during fast commands.

   But what if some idiot sends a long string of slow commands (like 100Н
clear-screens in a row)?  The FIFO would eventually overflow, and data wouldН
be lost.  I prevented this with "look-ahead" logic.  When the interruptН
handler spots a clear-screen command, it sets a flag so MAIN expects it. Н
MAIN can then ignore unnecessary commands (no sense altering a screen that'sН
about to be cleared). 

   Scrolling is one of the most difficult actions.  The obvious algorithm isН
to block move lines 2-24 up 1, then clear line 24.  But that's what IBM didН
on the PC, and we all know how well that worked.  So examine the 6845 CRTН
controller.  The Start-Address register holds the address of the firstН
character on the screen, the one displayed in the top left corner.  If weН
add 80 to it, line 2 instantly becomes the top line, and we've scrolled theН
whole screen up a line.  All that remains is to clear the 80 bytes that Н
form the new 24th line, for which we have a fast routine.

   Each scroll moves the start address up another 80 bytes.  This obviouslyН
can't go on indefinitely, so the original program spent a great deal of timeН
checking for overflow outside its 2K block of screen RAM (F800-FFFF).  ForН
instance, the old code read:

     ld   (hl),a    ; put character on screen
     inc  hl        ; advance to next
     ld   a,h       ; get new address
     or   0F8h      ; if overflow to 0000,
     ld   h,a       ;   force it to F800-FFFF

   But is this really necessary?  The schematic revealed that the 2K RAM wasН
partially decoded and actually occupied 16K in the Z80's address spaceН
(C000-FFFF).  It's far easier to insure that an address lies within thisН
range:

     ld   (hl),a    ; put character on screen
     res  6,h       ; insure we don't wrap to 0000
     inc  hl        ; advance to next

CASE #3
К   Fast Disk I/O.  Way back in 8 B.C. (eight years Before Clones) I had anН
S-100 system.  Its 8080 CPU blazed along at 1.843 MHz, through 32K of RAMН
spread over half a dozen furnace boards.  Two Shugart SA-801R single-sidedН
8" drives provided disk storage, with CP/M 1.4 tying it all together.  ThatН
old war horse and I fought many battles together, until it finally died theН
Death-of-1000-Intermittents.

   Many of its "features" I'd rather forget, but it had one outstandingН
attribute: the fastest floppies I've ever seen.  Warm boots were done beforeН
your fingers were off the keys; Wordstar loaded in under a second; PIPН
copied files at 10K bytes/sec.  All without a fast CPU, DMA, vectoredН
interrupts, or even a disk controller IC.  The "controller" was just a bunchН
of TTL chips implementing a parallel port, an 8-bit shift register, and aН
CRC checkcode generator.  The real work was done by the CPU, byte-bangingН
out the IBM 3740 SD/DD format in software.

   How good was it?  An 8" disk spins at 360 rpm, or 6 revs/sec.  Each trackН
held 6.5K (26 double-density sectors of 256 bytes each).  That makes theН
theoretical maximum transfer rate 6.5K x 6 = 39K bytes/sec.  It actuallyН
achieved 50% of this, or 20K bytes/sec.

   Few modern micros come anywhere near this level of performance.  TheН
Kaypro I wrote this article on creeps through the disk at 4K/sec.  My PCН
clone is closer, at 12K/sec.  The problem is that the CPU spends most of itsН
time in wait loops; waiting for the drive motor to start, for the head toН
load, for an index hole, for a certain sector to come around on the disk.Н
The capabilities of fast CPUs, elaborate interrupt systems, DMA, and fancyН
disk controllers are thrown away by crude software.

   The CPU has better things to do.  If the disk isn't ready when an Н
application program needs it, the BIOS should start the task, save the dataН
in a buffer, and set up an interrupt to finish the task later when the diskН
is REALLY ready.  The time lost to wait loops is thus reclaimed to run yourН
application programs.

   That's how my antique worked.  The BIOS maintained a track buffer in RAM.Н
The first read from a particular track moved the head to the desired trackН
and read the whole thing into the buffer.  Further reads from that trackН
simply came from RAM, taking virtually no time at all.

   Similarly, writes to a sector on the current track just put data in theН
buffer and marked it as changed.  The actual write was performed later, whenН
a new track was selected for read/write, or just before the drive timed outН
from a lack of disk activity.

   Physical track reads/writes were fast as well.  The key was to simplyН
begin wherever the head was.  After seeking to the desired track, it readН
the ID# of each sector encountered and transferred it to/from the appropriateН
place in the RAM buffer.  No need to find the index hole, wait for aН
particular sector ID#, or worry about interleave; one revolution got it all.

   Such a system must be implemented carefully.  CP/M does not expectНКdelayed error messages, which can produce some odd results.  For instance, aН
BDOS read error might be reported when the real cause was a write error inН
flushing the previous track buffer to disk.  Also, modern drives do not haveН
door locks to prevent disk removal if unwritten data remains in the trackН
buffer.

   The main factor limiting my S-100 system's performance was the slow CPUН
and lack of DMA.  A double-density 8" disk has a peak data transfer rate ofН
500K bits/sec, which only allows 16 microseconds between bytes. ThisН
required polled I/O where the CPU was 100% devoted to the disk during actualН
reads/writes.

   5-1/4" disks have a slower maximum transfer rate, but modern hardware hasН
advantages that can make up for it.  A normal 5-1/4" disk spins at 300 rpm,Н
or 5 rev/sec. Assuming 9 sectors of 512 bytes per track, the maximumН
transfer rate is 22.5K bytes/sec.  The peak data rate is 250K bits/sec, orН
32 microseconds per byte. This is slow enough for a 4 MHz Z80 to (barely)Н
handle it on an interrupt basis.  Here's an interrupt handler to read 256Н
bytes from a disk controller chip at 32 microseconds max. per byte:

T-states
   23                       ; time to finish longest instruction
   13                       ; Z80 interrupt mode 0 or 1 response
   11 int:  push af         ; save registers used
   11       in   a,(data)   ; read data byte from disk controller
   13 next: ld   (buffer),a ; store it in buffer (a variable)
   13       ld   a,(next+1) ; get buffer address
    4       inc  a          ;   increment
   13       ld   (next+1),a ;   save for next time
    7       jr   nz,done    ; if end of page, done
   10       pop  af         ;   else restore registers
   10       ret             ;   and return
  ----
  128 T-states max = 32 microseconds with a 4 MHz Z80

   But this routine barely squeaks by. It can't use interrupt mode 2 (whichН
adds 6 T-states to the response time) or signal Z80 peripherals that theН
interrupt is complete with an RETI (which adds 4 T-states).  It's limited toН
a 256-byte sector.  Worse, some disk controller chips need processing timeН
of their own.  The popular Western Digital FD179x series only allows 27.5Н
microseconds for each byte.

   So we have to get clever again.  The following example reads pairs ofН
bytes, the first on an interrupt and the second by polled I/O.  ThisН
improves performance to allow interrupt mode 2, larger sector sizes, and theН
slow response time of a FD179x chip:

T-states
   23                       ; time to finish longest instruction
   19                       ; Z80 interrupt mode 2 response time
   11 int:  push af         ; save A and flags
   11       in   a,(data)   ; read 1st byte from disk controllerК   11       push hl         ; save HL
   10 next: ld   hl,buffer  ; get buffer address (a variable)
    7       ld   (hl),a     ; store byte in buffer
    6       inc  hl         ; advance buffer pointer
    6       inc  hl         ;   for next interrupt
   16       ld   (next+1),hl;   & store it
    6       dec  hl         ;   point to current address
11+11 check:in   a,(status) ; check disk controller status
 4+ 4       rra             ; if not busy (bit 0=1),
 7+ 7       jr   nc,done    ;   then we're done
 4+ 4       rra             ; if next byte not ready (bit 1=0),
12+ 7       jr   nc,check   ;   then loop until it is
   11       in   a,(data)   ; get 2nd byte from disk controller
    7       ld   (hl),a     ;   & store it in buffer
   10       pop  hl         ; restore registers
   10       pop  af
   14       reti            ; return
  ----
  188 or 226 T-states (for 1 or 2 passes through status loop)

   This routine reads bytes from the controller chip within 17.75Н
microseconds worst-case.  Interrupt overhead averages 80% for a 4 MHz Z80,Н
leaving 20% for the main program execution.  The peculiar way ofН
incrementing the address pointer minimizes the worst-case delay from anН
interrupt or status flag change until the byte is read.  We want to maximizeН
the chance that the second character is ready the first time the status isН
checked.

   Why improve your disk system?  Because, as a practical matter, there'sН
more to be gained by improving it than any other change you could make. Н
It's disk I/O that sets the pace, not CPU speed or memory size.  UsersН
almost never wait on CPU speed; it's the disk that keeps you twiddling yourН
thumbs with the keyboard ignored, the screen frozen, and the disk driveН
emitting Bronx cheers.  Put a Commodore 64's tinkertoy disk system on an ATн
clone, and you'd have high-priced junk that only a masochist would use.Н
Conversely, the AT's DMA-based disk I/O would transform a C64 into a fireн
breathing dragon that would eat its competition alive.


                                 Algorithms

   When a hardware engineer sits down to design a circuit, he doesn't beginН
with a blank sheet of paper.  He has a vast library of textbooks, dataН
sheets, and catalogs of standard circuits to choose from.  Most of the taskН
is simply connecting off-the-shelf components into one of these standardН
configurations, modifying them as necessary to satisfy any uniqueН
requirements.

   Algorithms are to programmers what IC chips are to hardware designers.Н
Just as the engineer builds a library of standard parts and circuits, everyН
programmer must continually build his own algorithm collection.  WhetherН
it's a shoebox full of magazine clippings or a carefully indexed series ofНКnotebooks, start NOW.

   Programming textbooks tend to concentrate on traditional computer Н
algorithms for floating-point math, transcendental functions, and sortingН
routines.  The old standby is Knuth's "The Art of Programming".  Hamming'sН
"Numerical Methods for Scientists and Engineers" explains the basics ofН
iterative calculations.  "Digital Computation and Numerical Methods" byН
Southworth and Deeleeuw provides detailed flowcharts and sample code asН
well.

   Magazines are a great source and tend to be more down-to-earth and closerН
to the state of the art.  Read carefully!  Good algorithms may be expressedН
in BASIC listings, assembly code for some obscure processor, pocketН
calculator key sequences, and even disguised as circuit diagrams.Н
Professional journals like EDN or Computer Design are often better than theН
popular magazines, which have pretty much abandoned education in favor ofН
marketing.  Especially check out back issues.  The cruder the hardware, theН
trickier the algorithms had to be to make up for it.

   Manufacturers' technical literature is a gold mine. Get the Н
manufacturers' own manuals, not some boiled-down paperback from theН
bookstore.  They won't be models of clarity but are full of hidden gold.Н
Read everything, hardware and software manuals, data sheets, applicationН
notes, etc.

   User groups are the traditional source of solutions to specific Н
problems.  Even better, they provide actual implementations in printedН
listings, on disk, or even by modem.

   Don't waste time reinventing the wheel.  Learn from others what works,Н
and what doesn't.  Some of the best (and worst) algorithms I know were foundН
by disassembling existing programs.  And once you find a good algorithm,Н
recycle it.  That clever sort routine for an antique 8008 may be theН
foundation of the fastest '386 sort yet!


                                 Conclusion

   These techniques are not new; in fact old-timers will recognize many ofН
them from the early days of computing when hardware limitations were moreН
severe.  However, they have fallen into disuse.  A whole generation ofН
programmers has been taught that such techniques have no place in modernН
structured programming.

   The theory goes something like this: Programs written in a high-levelН
language are faster and easier to write, debug, document, and maintain.Н
Memory and speed are viewed as infinite resources, so the performance lossН
is unimportant.  Programs should be totally generic; it is the compiler's orН
run-time library's job to worry about the hardware interface.

   These rules make sense in a mainframe environment, where the hardwareН
resources are truly awesome, and teams of programmers spend years working onНКone application.  But they impose severe penalties on a microcomputerН
system.  The user must pay for the programmer's luxuries with higherН
hardware cost and lackluster performance.

   It's easy to forget that "microcomputer" literally means "one millionthН
of a computer".  Microprocessors make abysmally bad CPUs.  Build a computerН
with one, and you'll wind up needing $5000 worth of memory and peripheralsН
to support a $5 CPU chip.

   But micros make superlative controllers.  That's what they were designedН
for, and what they do best.  A single microcomputer can replace dozens ofН
boards and hundreds of ICs with as little as a single chip.  That's why 90%Н
of all microprocessors go into non-computer uses: calculators, auto emissionН
controls, home entertainment equipment, industrial controls, and the like.Н
Of 30 million Z80s sold last year, fewer than 1 million went into computers.

   Programming a controller is different than a computer.  Most applicationsН
demand real-time multi-tasking capabilities, and there is never enough speedН
or memory.  Inputs and outputs are physical hardware devices, not abstractН
data structures, so the code must inevitably be hardware-dependent. Н
Computer languages are just not cut out for this sort of thing.

   The question is not, "How do I write a computer program to handle thisН
data?"  Instead, you should ask yourself, "How must I manipulate thisН
hardware to do the job?"  The techniques in this article may be out of placeН
in the bureaucracy of a large computer but are right at home in the wildн
west world of a microcomputer.

   Lest you think this has nothing to do with a "real" computer like your PCН
clone, consider this.  Instead of a '286 with 1 meg of memory, suppose itН
contained ten Z80 controller boards, each with 64K of memory and a fastН
network to tie them together.  Each Z80 handles a different device:Н
keyboard, screen, printer, modem, and one for each disk.  The rest are freeН
to run application programs, several at a time!

   Suppose you're doing word processing on this system.  The keyboard doesН
spelling correction on data entry.  The screen Z80 displays your text inН
bit-mapped fonts to match the printer's Z80, which is simultaneouslyН
printing a file.  The Z80 running the word processor itself suffers noН
annoying pauses or hesitation, since disk I/O is handled instantaneously viaН
each drive's Z80 track buffer.  Meanwhile, the modem's Z80 is downloading aН
file while another assembles a program.  Pop-up utilities are ready andН
waiting in still other Z80s in case they're needed.

   Such a system would clearly have half the hardware cost of a PC, yetН
would outperform it by a wide margin.  True multi-tasking becomes child'sН
play with multiple processors.  More processors can be readily added forН
even higher performance, or removed to save cost (or continue operationН
while waiting for a replacement).

   If the computer scientists really want to further the micro-revolution,Н
they should stop trying to force antiquated mainframe languages onto microsНКand concentrate on developing tools to maximize the use of micros as theyН
are!

[This article was originally published in issue 40 of The Computer Journal,
P.O. Box 12, South Plainfield, NJ 07080-0012 and is reproduced with the
permission of the author and the publisher. Further reproduction for non-
commercial purposes is authorized. This copyright notice must be retained.
(c) Copyright 1989, 1991 Socrates Press and respective authors]
