Availability Guide for Application Design
Overview of Server and Network Fault Tolerance
Availability Guide for Application Design—525637-004
2-11
Extensive Hardware Error Checking
•
Disk subsystem
•
Power supplies and fans
Checking the ServerNet Fabric
Each ServerNet fabric comprises a set of data routers. Each router has input and
output connections to other routers, to a processor, or to a ServerNet addressable
controller.
Each router contains a self-checking application-specific integrated circuit. An address
validation table (AVT) ensures that data is sent to the correct destination. Routers are
self-diagnosing.
Data passing through the fabric is subjected to 32-bit cyclic redundancy checks to
ensure its integrity. A link keep-alive protocol similar to that used for processors tells
when links between routers go up and down so that data can be automatically rerouted
through the fabric. Protocol checkers ensure that data entering and leaving the fabric is
not lost because of communication protocol errors. Link-level flow control prevents lost
data packets when network congestion exists within the fabric. Transaction timeout
counters detect the need to restart a transaction when it cannot be completed.
Checking the Logic Boards
HP NonStop servers use the techniques listed below for checking logic boards:
1. Memory parity checks and error correcting codes
2. Cache and bus parity checks
3. Data-control checking
4. Lock-stepped processors
The above techniques are listed in order of sophistication. Many PCs use none of the
above; some use memory parity checks. Workstations and commodity servers typically
use memory parity and cache and bus parity. Some mainframes add data-control
checking. HP uses all four techniques.
Main Memory
Checking the main memory makes use of parity checks and error-correcting codes
(ECC). When a word of memory gets a single-bit error, the processor detects it, uses
the ECC information to derive the correct data, and rewrites the word. When a word of
memory gets a double-bit error, the processor generates a hardware trap and lets the
operating system determine the response, which might be to read a fresh copy of the
data page from disk, abort the process or, in the most severe case, halt the processor.
Cache and Bus
Parity checks are also performed on the caches and on the buses or ServerNet fabrics
that connect the various parts of the server system.