Availability Guide for Application Design

Overview of Server and Network Fault Tolerance

Availability Guide for Application Design—525637-004

2-10

Extensive Hardware Error Checking

Hardware modules are also independent of each other and do not share critical states

with other components. Processors do not share memory with each other. Critical

components have backup power supplies and fault-tolerant cooling. Each S-series or

NS-series ServerNet adapter or device controller has two ports; if one ServerNet fabric

or K-series I/O channel fails, the other port can be used to transfer data over the other

fabric or channel.

Fail-Fast Hardware and Software

System processes and critical hardware modules are designed to be fail-fast. In other

words, they must perform to specified standards or they halt and go offline before any

problem has the chance to propagate to other modules.

Hardware and software are made fail-fast through extensive error checking. Some

hardware components also perform periodic self-tests. The operating system performs

rigorous internal consistency checks to verify its inputs, outputs, and data structures.

In the extremely rare instance where an error occurs within a system process or the

operating system detects a corrupted data structure, the operating system halts the

processor and lets the backup processes in other processors take over. No two

processors have identical states so the error condition is not repeated in the backup.

This way, no malfunctioning system process is allowed to continue after the error is

detected.

Other vendors that do not support process pairs cannot react to failures in this way. In

those cases, the operating system tries to continue rather than use the fail-fast

technique. Such systems are vulnerable to data integrity problems and error

propagation.

Protection Against Invalid Application Operation

The server’s architecture also prevents an errant application from corrupting any data

outside its environment. When such a process attempts to perform an invalid

operation, the operating system aborts the process while allowing other processes in

the same processor to continue to run.

Extensive Hardware Error Checking

In order to perform fail-fast operations, the operating system must be able to diagnose

a fault instantly and reliably. All critical modules undergo rigorous testing. The testing is

done either by the operating system making periodic checks or by the individual

modules performing self-tests. The kind of self-test depends on the subsystem but

typically takes place as part of normal operation or when the modules are otherwise

idle; if an error occurs, the module either reports it to the operating system for

resolution or takes itself out of service.

Among the groups of hardware modules that are extensively checked are the following:

•

The ServerNet fabric of the S-series or NS-series server

•

Logic boards