NonStop Systems Introduction

The NonStop Kernel
NonStop Systems Introduction527825-001
6-13
Process Pairs
3. If a processor notes that two time periods have passed without the receipt of an
“I’m alive” message from some other processor, it compares notes with other
running processors and collectively they declare the “quiet” processor to be down.
4. The other processors then take over the workload of the failed processor.
Figure 6-8 on page 6-13 shows the transmission of “I’m alive” messages by the
message system in a three-processor system.
Process Pairs
You can see that the message system’s regular delivery of “I’m alive” messages to
each processor in a system guarantees that a failing processor is removed from the
system almost immediately, before it can cause any problems with processes running
in other processors. Other processors take over for the failing processor, and the
system continues to operate smoothly.
But what good would this fault containment be if the processes in the failing processor
did not have counterparts in healthy processors? When a processor fails, all the
processes that it was executing also fail. Therefore the system must allow processes
to run as process pairs.
The primary process in a process pair actively executes code in a particular processor.
The backup process occupies memory in another processor but does not actively
execute code. The primary process uses the message system to send a checkpoint
message to the backup process before it performs any critical operation (such as a
disk file update). In this way the backup process always has enough information to
take over from the primary process in case of failure.
Figure 6-9 on page 6-14 shows the implementation of a disk process as a process pair.
In normal operation, the primary process in Processor 0 periodically sends checkpoint
Figure 6-8. Processor Checking With “I’m Alive” Messages
Processor
0
Processor
1
Processor
2
I'm alive
I'm alive
I'm
alive
I'm
alive
I'm
alive
I'm alive
I'm alive
I'm alive
I'm alive
VST079.vsd