Availability Guide for Application Design
Overview of Server and Network Fault Tolerance
Availability Guide for Application Design—525637-004
2-9
Fault Isolation
periodically sending out “I’m alive” messages to the other processors in addition to
sending the same message to itself. Each processor periodically checks for “I’m alive”
messages from all the other processors. If a processor repeatedly fails to send its “I/m
alive” message, then all the other processes declare that processor to be down and
refuse to process any messages from it until it has been reloaded and reintegrated into
the server system.
Figure 2-4 on page 2-9 shows the distribution and receipt of “I’m alive” messages.
This scheme for detecting processor failures aids fault tolerance in two ways. It
prevents the failing processor from contaminating the rest of the server system. It also
allows the surviving processors to determine that they must take corrective measures,
such as taking ownership of the failed processor’s controllers and notifying the backup
processes of programs that were running in the failed processor that they must take
over.
Independent Processes and Hardware Modules
The server system provides a high degree of insulation (hence error containment)
between software modules. A process does not share any state with other processes.
Instead, it communicates through messages carried on its behalf by the message
system. Thus, each system or application process executes independently.
Figure 2-4. Detecting Processor Failure
Processor
0
Processor
1
Processor
2
"I'm
Alive"
"I'm
Alive"
"I'm
Alive"
"I'm
Alive"
"I'm
Alive"
"I'm
Alive"
VST204.vsd