NonStop Systems Introduction

The NonStop Kernel
NonStop Systems Introduction527825-001
6-12
Support for Fault Tolerance
ServerNet fabrics in the stand-alone system and over external communications lines in
the network.
Support for Fault Tolerance
You have seen that the message system blurs the boundaries between the processors
in a NonStop system and enables multiple processors to function as a single system.
A process running in any processor can communicate with a process in any other
processor by using the message system to send requests and receive replies.
Similarly, you have seen that the message system can deliver messages to a process
running in another system as easily as it can deliver messages within the same
system.
But this is not all the message system does. It also supports the fault tolerance of a
NonStop system by delivering the following kinds of messages:
It delivers “I’m alive” messages from each processor in the system to every other
processor. This enables processors to check the status of every other processor in
the system.
It enables the primary process of a process pair to send checkpoint messages
to a backup process.
Now consider each of these message system functions.
Processor Checking
A major goal of the multiprocessor architecture of NonStop systems is fault tolerance.
The basic idea is that if any processor in the system fails, other processors can take
over its workload by running copies of the processes that were running previously in
the failed processor. The individual processor failure does not stop the operation of the
system as a whole.
But how does the system know that one of its processors is failing? Facilities must be
provided for detecting a failing processor, removing it from the system, and repairing it
without bringing the rest of the system down. Because the message system can send
messages between processors, the NonStop system uses it as the basis for an
efficient failure-detection mechanism.
In planning for the failure of a processor, the designers of the NonStop system adopted
the following algorithm.
1. The operating system running in each processor sends a periodic “I’m alive” status
message to its own processor and to all other processors in the system. The
message indicates that the processor is functioning correctly.
2. At the same periodic intervals, each processor checks to see whether it has
received an “I’m alive” message from all the other processors.