Guardian Programmer's Guide

Table Of Contents
Guardian Programmer’s Guide 421922-014
27 - 1
27
Fault-Tolerant Programming in C
The term “fault-tolerant” means that a single failure does not cause processing to stop.
At the hardware level, redundant hardware and duplication of paths allow systems to
tolerate a single-component failure. In many cases, multiple-component failures can
also be tolerated as long as they do not share common paths. Moreover, the
redundant paths are not duplicate backups; that is, all available resources are used for
processing—none are held in reserve for use as spare backups. The hardware
concepts used to achieve this fault tolerance are explained in the Introduction to
Tandem NonStop Systems.
Software can be written to be fault-tolerant. Many software problems are transient;
that is, the problem is caused by an unusual environment state typically resulting from
a transient hardware problem, a resource limit exceeded, or a race condition. In such
cases, reinitializing the program state to an earlier point and resuming execution often
works because the environment is different.
An application does not execute in a fault-tolerant manner automatically; it must be
designed and implemented to run as a fault-tolerant program. This section describes
the approach to fault-tolerant programming known as active backup.
This section includes the following information:
An overview of the activities an active backup program must perform.
An overview of the tasks a programmer must complete to create an active backup
program.
A summary of the C language extensions that support active backup programming.
An explanation of how to organize an active backup program.
Two examples of active backup programs.
Overview of Active Backup Programming
In active backup programming, processes are executed in pairs: a primary process,
which performs the tasks of the underlying application, and a backup process, which is
ready to take over execution from the primary process should the primary process or
CPU fail. Active backup programs have the following characteristics:
Active backup uses process pairs to achieve fault tolerance.
The primary process sends state information to the backup process. State
information is information about the run-time environment that is required for the
backup to take over for the primary.
The backup process receives state information from the primary, detects a failed
primary process or CPU, and takes over execution.