NonStop Systems Introduction
The NonStop Kernel
NonStop Systems Introduction—527825-001
6-12
Support for Fault Tolerance
ServerNet fabrics in the stand-alone system and over external communications lines in 
the network.
Support for Fault Tolerance
You have seen that the message system blurs the boundaries between the processors 
in a NonStop system and enables multiple processors to function as a single system. 
A process running in any processor can communicate with a process in any other 
processor by using the message system to send requests and receive replies.
Similarly, you have seen that the message system can deliver messages to a process 
running in another system as easily as it can deliver messages within the same 
system.
But this is not all the message system does. It also supports the fault tolerance of a 
NonStop system by delivering the following kinds of messages:
•
It delivers “I’m alive” messages from each processor in the system to every other 
processor. This enables processors to check the status of every other processor in 
the system.
•
It enables the primary process of a process pair to send checkpoint messages 
to a backup process.
Now consider each of these message system functions.
Processor Checking
A major goal of the multiprocessor architecture of NonStop systems is fault tolerance. 
The basic idea is that if any processor in the system fails, other processors can take 
over its workload by running copies of the processes that were running previously in 
the failed processor. The individual processor failure does not stop the operation of the 
system as a whole.
But how does the system know that one of its processors is failing? Facilities must be 
provided for detecting a failing processor, removing it from the system, and repairing it 
without bringing the rest of the system down. Because the message system can send 
messages between processors, the NonStop system uses it as the basis for an 
efficient failure-detection mechanism.
In planning for the failure of a processor, the designers of the NonStop system adopted 
the following algorithm.
1. The operating system running in each processor sends a periodic “I’m alive” status 
message to its own processor and to all other processors in the system. The 
message indicates that the processor is functioning correctly.
2. At the same periodic intervals, each processor checks to see whether it has 
received an “I’m alive” message from all the other processors.










