Availability Guide for Application Design

Overview of Server and Network Fault Tolerance

Availability Guide for Application Design—525637-004

2-15

System Process Pairs

Synchronization

The disk process and the application each maintain a synchronization block containing

a synchronization identifier. When the system is operating normally, these

synchronization identifiers are routinely kept synchronized. In the event of a takeover

by the backup process, however, they provide a way to determine whether a write

operation finished, and therefore whether there is a need to retry the operation.

Under normal operating conditions, the synchronization identifier works as follows.

When the primary disk process receives a write request, its synchronization identifier is

incremented. It then checkpoints the synchronization block to the backup process at

the same time that it checkpoints the data. When the input or output is finished, the

primary process receives the completion status and sends a reply to the application.

The file system increments the application’s version of the synchronization identifier. At

this point, the application and the disk process are identical; that is, they are

synchronized.

If the primary process fails, the new primary process compares the synchronization

identifier on receiving the next message with the synchronization identifier of the last

processed message. Depending on whether the synchronization identifiers match, the

new message is either assumed to be a duplicate and ignored, or it is accepted as a

new request and processed.

Consider how synchronization works in the following possible failure situations:

•

The primary process fails before the checkpoint occurs.

•

The primary process fails after the checkpoint occurs.

If the primary disk process fails before the checkpoint occurs, no write operations have

been performed against the disk. The backup disk process and the application are still

synchronized; that is, their synchronization identifiers are the same. The file system

automatically retries the request to the backup process when it receives a path error on

its message to the primary process. On receipt of the new message, the

synchronization identifier in the backup is incremented. Because the synchronization

identifiers do not match, the backup process processes the request as a new request.

If the primary process fails after the checkpoint occurs, the backup process will have

the updated synchronization identifier and the information it needs to complete the

operation. The write operation is performed in its entirety by the backup process since

there is no way of knowing whether the primary process failed before, during, or after

the write. In this way, the write operation is assured completion regardless of when the

primary process failed. The completion status of the operation is saved, but the backup

process cannot reply to the application since it has not received a request from it. The

backup process now begins to process new requests as they are received on its

queue. When it receives the retry from the file system, the synchronization identifiers

match; the operations will not be performed, but the status will be returned so that the

application and the disk process are once again synchronized.