$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$ %$% %$% $%$ Electronic Switching System Faults $%$ %$% %$% $%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$ "Notes from No 2 ESS Administration and Maintenance Plan," "BSTJ Vol 48, 1969" "Data Maintenance" Memory mutilation results from hardware faults and program bugs. During nonsynchronous operation mismatch detection not available so there may be a long period of time during which mutilation occurs. Mismatch detection useless in finding data mutilation caused by program bugs. Data maintenance aided by ease of communication among programs, absence of linked lists, and per call memory allocation (Call processing program addressing is relative to the allocated memory, reducing scope of data accesses). Defensive programming techniques: Range check table indexes, Zero check derived transfer-to addresses, and Distinct program and data errors prevent programs being read as data. Audit programs detect bad data. Audits run periodically or as requested from tty. Separate audits for different memory blocks Audits correct by idling memory blocks containing bad data. System recovery initiated by control unit switch during simplex operation, control unit switch can be caused by bad data or bugs that cause sanity time out. System recovery Funtions: Make call store consistent with state of periphery. Clear memory associated with program in control at time of recovery, Run audits, Repeat the above with widening scope of memory initialization until sanity obtained "Notes from Design of Recovery Strategies for A Fault Tolerant No. 4 ESS" "by R. J Willet - BSTJ vol 61, no 10, 4-13-82" "Objectives" 616,000 call attempts/hour 100,000 acive terminations Downtime less than 2 hours in 40 years Not cost-effective (or possible) to remove all software errors - minimize number of service effecting errors and analyze data for cause. "Software Recovery" Reconstruct data from associated information - slow, disturbs few calls. Reinitialize memory structure - fast, disturbs many calls. "Audit Programs" Provide for integrity of system memory Structured into mutilation detection and correction modules Detection modules run continiously in background Detection modules augmented by defensive checks in operational programs Call correction modules to correct errors found by background audits or defensive checks. "System Integrity Programs" Provide for integrity of programs Monitor job scheduling and sequencing for frequency and execution times Use sanity timers Call audits or reinitialize system to correct errors. "Recovery from software problems" Software problems caused by program errors or bad data Out-of-range accesses trigger hardware interrupt, recovery requires correction of data, or killing of call and return of control to a safe point. Inhibit (pest) interrupts while audits are correcting problem, risky, but assumes single software fault. In cases where the out-of-range error can be isolated to a single unit can use frame level pesting, otherwise use system level pesting. Software recovery does not consider the possibility of a hardware fault. Recovery cannot fix a program bug. Running pested may allows the system to operate in a degraded fashion while maintenance personnel analyze data and correct program. The buffer overflow problem - may be caused by program error. Buffers protected by hardware overflow interrupts. Recovery runs the buffer unloader program to unload the buffer and audits the task dispenser program to ensure the unloader is scheduled properly. The overflow interrupt is pested. If problem continues, hardware is suspect. "No. 4 ESS: Maintenance Software" "by M. N. Meyers, W. A. Routt and K. W. Yoder," "BSTJ Vol. 56, No. 7, September 1977" "Software Error Recovery" Since system operation is dependent on data in memories, and memories can be written, there is a possibility the memory will be in a state that precludes operation. System must be as error-free as possibile. Since system cannot be completely error-free, it must be error tolerant. "Classification of software errors" Errors in interfaces between software modules. Non-conformity to systems rules. KsO$H@!  fw Logic errors. Coding errors. Complex man-machine interfaces that lead to procedural errors. "Effects of software errors" Loss of viability Loss of call processing Loss of a facility Loss of a functions Loss of capacity Loss of single call No effect "Error Prevention" Standardization, Simplification, Improved Documentation, I wonder why improved testing isn't listed "Error Tolerance" Error tolerance acheived through defensive programming and defensive memory. Programs attempt to remain operational in presence of errors Restrict access to data to noncritical regions where errors have minor impact. (Using special instrutions to write to critical memory). Checking state codes Range Checks Positive Decisions Symbolic Addressing Interpreting all possible stimuli Link by index rather than absolute address Special purpose memory allocators "Error Handling" "Software Integrity Control" Receives reports of all software errors, decides and activates corrective action. Corrective action: request an audit, or if history indicates repition, request a phase of system initialization. Receives reports of overload - must decide if these are true, or are the results of software errors and take appropriate action. "Audits" Detect and correct data errors. Control structure and audit routines specific to particular data structures. Control structure schedules routine audits. Demand audits run when errors found or suspect. Error detection through: comparison of duplicate structures, association (correct structures linked together), and semantics (data is reasonably correct). U"qGu "Integrity monitor system" Detects shceduling and cycling irregularities and los of major system functions. Time monitors detect basic sanity Base level monitor verifies validity of base cycle Test call program observes progress of test calls "Recovery" Remedial actions - audits - correct specific errors. If remedial actions fail undertake system recovery. System recovery reconfigures hardware and initializes memory. System recovery initiated upon detection of: Mutilation of program memory Loss of vital system function, Loss of major facility, Escalation of remedial actions, Software integrity monitor problems, Sanity Timeout Mutilation of software clock, Duplex failure System recovery phases: Phase 1 - initialize specific areas of transient memory - does not kill calls, Phase 2 - reconfigure peripheral hardware and initialize all transient memory - kills only non-stable calls. Phase 3 - reconfigure processors, initialize all memory - kills only non-stable calls. Phase 4 - totally initialize system - kills all calls - manual activation only. More detailed and specific manual recovery procedures can be initiated. "Notes from DS5C 3.00.00.00 - Issue 2 No. 5 ESS Maintenance Plan, 40288-500" Up to 127 interface module processors - each is duplicated. No simplex failure causes service loss, duplex failures cause service loss for only a few terminations. Central Processor (CP) is duplicated. The time multiplexed switch contains a duplicated processor. The Control Distribution Units are not duplicated, since a single one is sufficient for inter-processor communications. The Input/Output Processor is duplicated. CP maintenance is under control of CP programs. CP programs have minimal involvement in peripheral unit maintenance. IM resposible for own maintenance, with few exceptions (manual requests, etc.). IM uses self-checking hardware and operational checks. Inter-processor messages use a reliable communications protocol. Test for errors during use, scheduled tests for equipment not frequently used or in standby mode. Test error detecting capabilities by inducing errors. Operational software can detect some hardware errors through in-line checks, and keep history to isolate error. Audits and defensive checks to reduce software errors. Software reliability is obtained at the expense of real time performance. Defensive checks: Range checks on pointers and indexes. Duplicate copies of data at two different locations to detect mutilation. Two-way linked lists. Maintain and check redundant status information. Positive decisions Limited use of GOTOs One entry point per module Check incoming buffered data for mutilation while in buffer. Ack timeouts. Audits check for resources in an invalid state and restore the resource to a valid state. Audits try to detect erroneous states before system performance is affected. Loss of all of a given class of resource may halt call processing. Long loops in programs cause them to consume all of the real time and put the system in overload. Auditing of data resident in more than one processor leads to a race condition. If audits find one error in a data structure then all associated data is assume to be bad. Processes audited by timing, timeout in transient states indicated errors. Scheduled audits have low priority. Errors detected by audits are output on TTY. Audit history is maintained in CP and can be queried. Audits are stitched together during initialization. Craft can request audits. CP has audits for its data, IM has audits for its data, CP controls audits for data shared by IM and CP. Test calls are generated and monitored to detect timing irregularities. Operating system monitors scheduling to detect overload or fault. Hardware sanity timers to detect basic processor sanity. Overload is caused by a shortage of resources. In overload give preference to terminating traffic. During growth intervals, audits must be gracefully updated to test new resources. "Notes from DS5D 5.00.02.00 - Issue 1 Feature Control Development Specification Case 40288-100" Feature control programs must explicitly record state during real time break, even if the state can be derived implicitly by the program. Restricted control of the process stack improves reliability. Program design represented by finite state machine. "Notes from 5ESS AUDITS Description and Procedures Manual, Issue 2, 1983" Audits are 10% of 5ESS software. Audits are powerful enough to disconnect stable calls and initialize the system. 5ESS software is divided into 5 areas: operating system, call processing, Administration, database, and maintenance (The largest). Audits are responsible for data in each area. The maintenance area uses five system integrity features: Asserts (defensive checks), integrity monitors (sanity timers and other checks), initializations, and audits. Application audits run under OSDS which provides: Process management, Resource allocation, Process synchronization and communication, Inter-processor communication, IM interrupt and fault handling, Timing services. DMERT has audits that run in the System Integrity Monitor environment at the CP, not descibed here. Audit control in charge of audits in all environments, different audits are run in different environments. Routine audits are segmented - 20 millisecond segments for error detection, 40 milliseconds for error correction. Audits may run in 5 modes: routine segmented, (run at lowest priority, take real-time breaks, output results to file, escalate if critical error found) elevated segmented, (run at second lowest priority, take real-time breaks, caused by; input messages, asserts, single process purges, other audits, if cannot correct errors in 40 msec, either escalate or return to safe point) nonsegmented, (no real-time breaks) initialization, (run sets of audits back-to-back with no real-time break) postinitialization. (used to restore non-critical data after initialization) A single event number is assigned to all audit actions resulting from the same error. Asserts placed in operational code (includeing audit code) to detect: Out-of-range indexes and pointers Redundancy on duplicate data, Point-to/point-back linkage, consistency between related data, invalid function return codes. Failing asserts usually trigger audits. Asserts are a convention for defensive programming that standardize history keeping, and output messages. Audits classify errors as: process errors, non-process errors, or audit failures. Process errors cause single process purge. Non-process errors are accumulated until end of segment and then an elevated audit is scheduled to correct them. Audit failures (the error is such that the audit can't continue) result in scheduling of audits to fix the error. Interprocessor audits check duplicate data in different processors. Interprocessor audits run in routine and elevated segmented modes. Interprocessor audits triggered manually, by asserts, and routinely. Interprocessor audits correct data by making CP agree with IM. Interprocessor audits divided into partitions that run on a single processor. Interprocessor audit execution controled by CP, which receives completetion reports from each partition. About 70 application audits. Current software development methodologies incapable of producing fault-free programs. Operator mistakes introduce incorrect data into system. Hardware faults cause data mutilation. Early detection and correction of fault effect retains system availability, at a cost (usually mishandling of single job). "DEFINITIONS" Fault - improper action Error - the result of some faults - an erroneous states Failure - incorrect system output (wrong or delayed) Outage - Failure effecting significant number of jobs Fault tolerant systems prevent errors from becoming failures and failures from becoming outages. Faults that do not produce an erroneous state are not detectable "CLASSIFICATION" "ESS Software Reliability" "Objectives" Same as achieved by electro-mechanical systems: No more than 2 hours total outage time in 40 years (720 seconds allocated to outages called by software faults). No more than 2 calls out of 10,000 mishandled. About 100-200 call attempts/sec Tolerable response times range from milli-seconds to seconds. "System Characteristics - 1ESS, 1AESS, 4ESS" Base level cycle partitioned into A,B,C,D,E and interject priorities. Interrupts for timing and hardware exceptions. I/O at interrupt level Time shared CPU with segmented (interleaved) processing System memory paritioned into: Text Parameters (describe office size) Translations (describe connections) Shared Memory Note: This file was in EMACS , I hope I edited it good enough to read, so that it provides use. -DT ----------------------------------------------------------------------------