Availability Guide for Application Design

ManualsBrandsHP ManualsServerHP NonStop G-Series

181

182

183

184

185

186

187

188

189

190

Availability Through Process-Pairs and Monitors

Availability Guide for Application Design—525637-004

7-3

Approaches to Takeover

The challenge in designing a process pair is making sure that the backup process on

takeover has the same context, data, and state information that the primary process

had when it failed. To achieve this condition, the backup process must take over

processing from a known point slightly before the point where the primary process

failed.

Code in the backup process is identical to that in the primary, so there is no problem in

reexecuting the same code. The difficulty is in synchronizing file system activities. The

backup process must repeat, retry, or safely back out any pending operations before

resuming normal operation. These tasks are especially challenging when your

application handles nowait depths greater than one; in that case, for example, you

must determine the completion status of all input and output operations.

As soon as the backup process resumes processing, it must also be ready to handle

retries of requests that were pending when the primary process failed.

Suppose, just before stopping, the primary process issued a request that a specific

action should be carried out but did not receive a response to indicate the completion

status. The backup process, on takeover, does not know whether the request was

completed. Should the backup process reissue the request and risk duplicating the

action? Or should the backup process ignore the request and risk not processing the

request at all?

Upon takeover, two strategies are possible for solving this problem and continuing

execution in a consistent state:

•

Repeat any request for which the completion status is unknown. File system

synchronization identifiers allow the server of the request to identify and handle

duplicate requests.

•

Roll back any changes that might have happened since a known point of

consistency.

Operations that can be repeated without violating the integrity of the application are

known as retryable operations.

Operations That Are Retryable

For some operations, it is clearly appropriate to repeat the request. The full sector write

operation discussed in Section 2, Overview of Server and Network Fault Tolerance, is

one obvious example. Other examples include drilling a hole at (x, y) and launching a

specific missile.

In all these examples, it does not matter whether the operation successfully finished

when first requested by the primary process. If the primary request to drill a hole at

(x, y) was not carried out, then the backup process reissues the request and the hole is

drilled. If the primary request was done, then no harm is done by drilling the hole again

at the same location. Similarly, if the missile was not fired by the primary process, it