Availability Guide for Application Design

Availability Through Process-Pairs and Monitors
Availability Guide for Application Design525637-004
7-3
Approaches to Takeover
Approaches to Takeover
The challenge in designing a process pair is making sure that the backup process on
takeover has the same context, data, and state information that the primary process
had when it failed. To achieve this condition, the backup process must take over
processing from a known point slightly before the point where the primary process
failed.
Code in the backup process is identical to that in the primary, so there is no problem in
reexecuting the same code. The difficulty is in synchronizing file system activities. The
backup process must repeat, retry, or safely back out any pending operations before
resuming normal operation. These tasks are especially challenging when your
application handles nowait depths greater than one; in that case, for example, you
must determine the completion status of all input and output operations.
As soon as the backup process resumes processing, it must also be ready to handle
retries of requests that were pending when the primary process failed.
Suppose, just before stopping, the primary process issued a request that a specific
action should be carried out but did not receive a response to indicate the completion
status. The backup process, on takeover, does not know whether the request was
completed. Should the backup process reissue the request and risk duplicating the
action? Or should the backup process ignore the request and risk not processing the
request at all?
Upon takeover, two strategies are possible for solving this problem and continuing
execution in a consistent state:
Repeat any request for which the completion status is unknown. File system
synchronization identifiers allow the server of the request to identify and handle
duplicate requests.
Roll back any changes that might have happened since a known point of
consistency.
Operations that can be repeated without violating the integrity of the application are
known as retryable operations.
Operations That Are Retryable
For some operations, it is clearly appropriate to repeat the request. The full sector write
operation discussed in Section 2, Overview of Server and Network Fault Tolerance, is
one obvious example. Other examples include drilling a hole at (x, y) and launching a
specific missile.
In all these examples, it does not matter whether the operation successfully finished
when first requested by the primary process. If the primary request to drill a hole at
(x, y) was not carried out, then the backup process reissues the request and the hole is
drilled. If the primary request was done, then no harm is done by drilling the hole again
at the same location. Similarly, if the missile was not fired by the primary process, it