FORTRAN Reference Manual

Fault-Tolerant Programming
FORTRAN Reference Manual528615-001
16-4
Overview of Fault- Tolerant Programs
Overview of Fault- Tolerant Programs
The following actions occur when you run a fault-tolerant program:
The primary process opens the initial set of files required for its operation.
The primary process starts its backup process in another processor by executing a
START BACKUP statement. START BACKUP, in addition to starting the backup
process, sends the backup checkpoint information for files open in the primary
process. Process pairs open files in a way that permits both members of the pair to
access the file. For disk files opened in this way, a record lock or file lock specified
by the primary process is equivalent to a lock by the backup.
The backup process, at the start of its execution, automatically begins monitoring
the primary process. The backup proceeds no further unless a failure occurs.
The primary process begins executing its main processing loop. At critical points in
the loop (for example, just before write operations to disk files), the primary
process executes CHECKPOINT statements to send program state and file control
data to the backup process and establish takeover points for the backup. A
takeover point is established in the backup process by the most recently executed
CHECKPOINT statement that does not specify STACK='NO'. OPEN and CLOSE
statements also establish takeover points in the backup unless you specify STACK
= 'NO' for those statements.
A program can contain many CHECKPOINT statements. You usually code
CHECKPOINT statements so as to ensure that logical groupings of data are
preserved in the backup process.
For example, you frequently execute a CHECKPOINT statement immediately
before you execute a WRITE statement so that if the WRITE statement fails, or the
processor in which your primary runs fails, all the processing up to the point of the
WRITE statement is preserved in the backup process. If the backup process takes
over processing, the first statement it executes is the WRITE statement for which it
has all the information it needs. Here is an example:
Primary process:
...
CHECKPOINT
WRITE(6, 100) r, s
Primary’s processor fails, backup takes over:
CHECKPOINT <-- Backup does NOT re-execute
WRITE(6, 100) r, s <-- Backup begins HERE by re-
executing the WRITE statement