Availability Guide for Application Design

Designing Applications for Change
Availability Guide for Application Design525637-004
10-40
Replacing the Process
The technique depends upon both V1 and V2 versions of the application program
providing the following:
Active rather than passive backup
This technique is unsuitable for many passive-backup process pairs because they
do not use restart checkpoints. Restart checkpoints are needed to contain the
current procedure call stack of the primary process, which contains the addresses
of functions called within the current object file. Changing the object file changes
these addresses, but the changes cannot be known to the backup process unless
a restart checkpoint is done.
Many passive-backup process pairs do not pass the stack content during a
checkpoint operation because the stack is an optional component of the
checkpoint data. If a passive-backup process pair attempts to use this technique
without passing the stack content during checkpoints, then it must be coded such
that it does not reference a stack address during the transition period and makes
the transition period as brief as possible.
Code to select and enforce a suitable time for the transition
A suitable time is defined as:
No outstanding transactions exist
New incoming requests can be delayed
A “switch” command and code to report the success or failure of the transition
The Subsystem Control Facility (SCF) Kernel subsystem PRIMARY PROCESS
command is an example of this kind of command.
Code to recognize the difference between voluntary and involuntary termination of
a backup process, so that a new V1 backup is not automatically started at the
wrong time
Code in each version of the application for the backup process to exchange
context information (checkpoint data) with the prior, current, and next versions of
the primary process
This code provides both the migration mechanism and a fallback mechanism, in
case the process must be repeated to replace the newer code with a more stable
older version.
The technique works as follows:
1. The process pair runs in its normal resilient mode, where the backup process
becomes the primary process when the original primary process fails. The primary
process is in processor 1 and the backup process is in processor 2. Both
processes are running version V1 code.
2. An operator utility such as an SCF module or a site-written interactive interface
sends a “switch” command to the primary process requesting that the following
happen: