Availability Guide for Application Design

Overview of Server and Network Fault Tolerance

Availability Guide for Application Design—525637-004

2-3

Fault Tolerance in the Server System

For example, if a disk drive has an MTBF of 1 million hours (114 years), you would still

endure, on average, 8.8 failures each year in a device population of 1000 disks. An

MTBF of 1 million hours does not mean that a particular disk drive will not fail for 1

million hours. MTBF figures indicate the reliability performance of a device population

during the useful life of the device.

MTBF figures for systems with massive numbers of processors are similarly

misleading. If a processor has an MTBF of 5.5 years, you would still expect a

processor failure every 2 days in a system with 1000 processors. Parallel architecture

greatly improves the fault tolerance of systems as system size and complexity

increases.

Mirrored disks, for example, substantially increase the availability of the data volume.

When two disks using identical copies of the same data are used, if one disk fails, the

other is still available. Mirrored disks are an example of how HP uses parallelism to

raise the availability of components by eliminating single points of failure. Increased

availability is a major benefit of the parallel architecture.

HP servers ensure both high availability and high performance because each server is

configured with parallel hardware modules and parallel software processes. The key to

the architecture is that there is no single hardware module or software process whose

failure can bring the server down. Multiple paths, modules, and processes make it

possible for the server to continue to operate despite the failure of an individual

module. There is always an alternate hardware module or software process that takes

over from the errant component.

To support this approach to fault tolerance, HP NonStop servers contain extensive

logic that:

1. Detects errors

2. Isolates the errant module and contain the error

3. Masks the problem by rerouting work to an alternate module

To implement this scheme, HP NonStop servers are designed around the following

concepts:

•

Parallel hardware enables a suitably connected alternate module to take over the

work of an errant module.

•

Isolated hardware modules in a loosely coupled architecture allow faults to be

contained within the errant module.

•

Extensive error checking detects a problem.

•

System software process pairs provide a backup process that is primed, ready to

take over if the primary member of the pair fails.

•

Instrumentation of server system components informs the operator in the event of

a component failure.