Availability Guide for Application Design
Overview of Server and Network Fault Tolerance
Availability Guide for Application Design—525637-004
2-3
Fault Tolerance in the Server System
For example, if a disk drive has an MTBF of 1 million hours (114 years), you would still 
endure, on average, 8.8 failures each year in a device population of 1000 disks. An 
MTBF of 1 million hours does not mean that a particular disk drive will not fail for 1 
million hours. MTBF figures indicate the reliability performance of a device population 
during the useful life of the device.
MTBF figures for systems with massive numbers of processors are similarly 
misleading. If a processor has an MTBF of 5.5 years, you would still expect a 
processor failure every 2 days in a system with 1000 processors. Parallel architecture 
greatly improves the fault tolerance of systems as system size and complexity 
increases.
Mirrored disks, for example, substantially increase the availability of the data volume. 
When two disks using identical copies of the same data are used, if one disk fails, the 
other is still available. Mirrored disks are an example of how HP uses parallelism to 
raise the availability of components by eliminating single points of failure. Increased 
availability is a major benefit of the parallel architecture.
HP servers ensure both high availability and high performance because each server is 
configured with parallel hardware modules and parallel software processes. The key to 
the architecture is that there is no single hardware module or software process whose 
failure can bring the server down. Multiple paths, modules, and processes make it 
possible for the server to continue to operate despite the failure of an individual 
module. There is always an alternate hardware module or software process that takes 
over from the errant component.
To support this approach to fault tolerance, HP NonStop servers contain extensive 
logic that:
1. Detects errors
2. Isolates the errant module and contain the error
3. Masks the problem by rerouting work to an alternate module
To implement this scheme, HP NonStop servers are designed around the following 
concepts:
•
Parallel hardware enables a suitably connected alternate module to take over the 
work of an errant module.
•
Isolated hardware modules in a loosely coupled architecture allow faults to be 
contained within the errant module.
•
Extensive error checking detects a problem.
•
System software process pairs provide a backup process that is primed, ready to 
take over if the primary member of the pair fails.
•
Instrumentation of server system components informs the operator in the event of 
a component failure.










