Availability Guide for Application Design

Overview of Server and Network Fault Tolerance
Availability Guide for Application Design525637-004
2-3
Fault Tolerance in the Server System
For example, if a disk drive has an MTBF of 1 million hours (114 years), you would still
endure, on average, 8.8 failures each year in a device population of 1000 disks. An
MTBF of 1 million hours does not mean that a particular disk drive will not fail for 1
million hours. MTBF figures indicate the reliability performance of a device population
during the useful life of the device.
MTBF figures for systems with massive numbers of processors are similarly
misleading. If a processor has an MTBF of 5.5 years, you would still expect a
processor failure every 2 days in a system with 1000 processors. Parallel architecture
greatly improves the fault tolerance of systems as system size and complexity
increases.
Mirrored disks, for example, substantially increase the availability of the data volume.
When two disks using identical copies of the same data are used, if one disk fails, the
other is still available. Mirrored disks are an example of how HP uses parallelism to
raise the availability of components by eliminating single points of failure. Increased
availability is a major benefit of the parallel architecture.
HP servers ensure both high availability and high performance because each server is
configured with parallel hardware modules and parallel software processes. The key to
the architecture is that there is no single hardware module or software process whose
failure can bring the server down. Multiple paths, modules, and processes make it
possible for the server to continue to operate despite the failure of an individual
module. There is always an alternate hardware module or software process that takes
over from the errant component.
To support this approach to fault tolerance, HP NonStop servers contain extensive
logic that:
1. Detects errors
2. Isolates the errant module and contain the error
3. Masks the problem by rerouting work to an alternate module
To implement this scheme, HP NonStop servers are designed around the following
concepts:
Parallel hardware enables a suitably connected alternate module to take over the
work of an errant module.
Isolated hardware modules in a loosely coupled architecture allow faults to be
contained within the errant module.
Extensive error checking detects a problem.
System software process pairs provide a backup process that is primed, ready to
take over if the primary member of the pair fails.
Instrumentation of server system components informs the operator in the event of
a component failure.