Availability Guide for Application Design
Overview of Server and Network Fault Tolerance
Availability Guide for Application Design—525637-004
2-10
Extensive Hardware Error Checking
Hardware modules are also independent of each other and do not share critical states
with other components. Processors do not share memory with each other. Critical
components have backup power supplies and fault-tolerant cooling. Each S-series or
NS-series ServerNet adapter or device controller has two ports; if one ServerNet fabric
or K-series I/O channel fails, the other port can be used to transfer data over the other
fabric or channel.
Fail-Fast Hardware and Software
System processes and critical hardware modules are designed to be fail-fast. In other
words, they must perform to specified standards or they halt and go offline before any
problem has the chance to propagate to other modules.
Hardware and software are made fail-fast through extensive error checking. Some
hardware components also perform periodic self-tests. The operating system performs
rigorous internal consistency checks to verify its inputs, outputs, and data structures.
In the extremely rare instance where an error occurs within a system process or the
operating system detects a corrupted data structure, the operating system halts the
processor and lets the backup processes in other processors take over. No two
processors have identical states so the error condition is not repeated in the backup.
This way, no malfunctioning system process is allowed to continue after the error is
detected.
Other vendors that do not support process pairs cannot react to failures in this way. In
those cases, the operating system tries to continue rather than use the fail-fast
technique. Such systems are vulnerable to data integrity problems and error
propagation.
Protection Against Invalid Application Operation
The server’s architecture also prevents an errant application from corrupting any data
outside its environment. When such a process attempts to perform an invalid
operation, the operating system aborts the process while allowing other processes in
the same processor to continue to run.
Extensive Hardware Error Checking
In order to perform fail-fast operations, the operating system must be able to diagnose
a fault instantly and reliably. All critical modules undergo rigorous testing. The testing is
done either by the operating system making periodic checks or by the individual
modules performing self-tests. The kind of self-test depends on the subsystem but
typically takes place as part of normal operation or when the modules are otherwise
idle; if an error occurs, the module either reports it to the operating system for
resolution or takes itself out of service.
Among the groups of hardware modules that are extensively checked are the following:
•
The ServerNet fabric of the S-series or NS-series server
•
Logic boards