Availability Guide for Application Design
Overview of Server and Network Fault Tolerance
Availability Guide for Application Design—525637-004
2-15
System Process Pairs
Synchronization
The disk process and the application each maintain a synchronization block containing
a synchronization identifier. When the system is operating normally, these
synchronization identifiers are routinely kept synchronized. In the event of a takeover
by the backup process, however, they provide a way to determine whether a write
operation finished, and therefore whether there is a need to retry the operation.
Under normal operating conditions, the synchronization identifier works as follows.
When the primary disk process receives a write request, its synchronization identifier is
incremented. It then checkpoints the synchronization block to the backup process at
the same time that it checkpoints the data. When the input or output is finished, the
primary process receives the completion status and sends a reply to the application.
The file system increments the application’s version of the synchronization identifier. At
this point, the application and the disk process are identical; that is, they are
synchronized.
If the primary process fails, the new primary process compares the synchronization
identifier on receiving the next message with the synchronization identifier of the last
processed message. Depending on whether the synchronization identifiers match, the
new message is either assumed to be a duplicate and ignored, or it is accepted as a
new request and processed.
Consider how synchronization works in the following possible failure situations:
•
The primary process fails before the checkpoint occurs.
•
The primary process fails after the checkpoint occurs.
If the primary disk process fails before the checkpoint occurs, no write operations have
been performed against the disk. The backup disk process and the application are still
synchronized; that is, their synchronization identifiers are the same. The file system
automatically retries the request to the backup process when it receives a path error on
its message to the primary process. On receipt of the new message, the
synchronization identifier in the backup is incremented. Because the synchronization
identifiers do not match, the backup process processes the request as a new request.
If the primary process fails after the checkpoint occurs, the backup process will have
the updated synchronization identifier and the information it needs to complete the
operation. The write operation is performed in its entirety by the backup process since
there is no way of knowing whether the primary process failed before, during, or after
the write. In this way, the write operation is assured completion regardless of when the
primary process failed. The completion status of the operation is saved, but the backup
process cannot reply to the application since it has not received a request from it. The
backup process now begins to process new requests as they are received on its
queue. When it receives the retry from the file system, the synchronization identifiers
match; the operations will not be performed, but the status will be returned so that the
application and the disk process are once again synchronized.