Managing HP Serviceguard for Linux, Eighth Edition, March 2008

Designing Highly Available Cluster Applications
Controlling the Speed of Application Failover
Appendix B 341
Another example is an application where a clerk is entering data about a
new employee. Suppose this application requires that employee numbers
be unique, and that after the name and number of the new employee is
entered, a failure occurs. Since the employee number had been entered
before the failure, does the application refuse to allow it to be re-entered?
Does it require that the partially entered information be deleted first?
More appropriately, in a highly available environment the application
will allow the clerk to easily restart the entry or to continue at the next
data item.
Use Checkpoints
Design applications to checkpoint complex transactions. A single
transaction from the user's perspective may result in several actual
database transactions. Although this issue is related to restartable
transactions, here it is advisable to record progress locally on the client
so that a transaction that was interrupted by a system failure can be
completed after the failover occurs.
For example, suppose the application being used is calculating PI. On the
original system, the application has gotten to the 1,000th decimal point,
but the application has not yet written anything to disk. At that moment
in time, the node crashes. The application is restarted on the second
node, but the application is started up from scratch. The application
must recalculate those 1,000 decimal points. However, if the application
had written to disk the decimal points on a regular basis, the application
could have restarted from where it left off.
Balance Checkpoint Frequency with Performance
It is important to balance checkpoint frequency with performance. The
trade-off with checkpointing to disk is the impact of this checkpointing
on performance. Obviously if you checkpoint too often the application
slows; if you don't checkpoint often enough, it will take longer to get the
application back to its current state after a failover. Ideally, the end-user
should be able to decide how often to checkpoint. Applications should
provide customizable parameters so the end-user can tune the
checkpoint frequency.