Availability Guide for Application Design

Overview of Server and Network Fault Tolerance
Availability Guide for Application Design525637-004
2-12
System Process Pairs
Data-Control Logic
Checking of the data-control logic also involves parity and other checks.
Processors
The NonStop range of servers compares output from lock-stepped processors. Two
identical processors execute the same code at the same time. Special logic verifies
that the output of one chip is always the same as the output from the other chip.
Checking the Disk Subsystem
The disk subsystem uses many of the checks used by the logic boards. In addition, HP
NonStop servers perform end-to-end checksum operations to ensure that the data
written to the disk is identical when read back from the disk. The control data written to
the disk also contains the disk address to ensure that the correct block has been
retrieved. The control data is included in the checksum information.
Power Supplies and Fans
The power supplies and fans have environmental sensors that ensure that the voltages
remain within specified ranges, the fans are always turning, and the temperature
remains within a specified range.
System Process Pairs
Many system processes run as process pairs so that a backup process is able to take
over processing if the primary process stops for any reason. The I/O subsystem uses
process pairs extensively.
Takeover can occur for many reasons. Hardware failure is not the only reason that the
backup process of the process pair might need to take over from the primary process.
Transient software errors can also cause a processor to fail.
Process Pairs Provide Effective Protection Against
Transient Errors
A mechanism for responding to transient software errors is important because, in a
production system, these are the most likely form of software errors to occur.
Deterministic software errors—sometimes known as hard errors—can usually be
identified and fixed during software testing or, at least, by the time the software has
been in production for a short period. Transient errors, however, are harder to eliminate
because it is impractical to test all possible combinations of system code. It is therefore
inevitable that some transient errors will remain.
Process pairs provide effective protection against transient errors because the
background process has a different processing environment than the primary.
Executing programs, internal timing, memory layout, queues, I/O channel, and so on