Managing HP Serviceguard for Linux, Eighth Edition, March 2008

Designing Highly Available Cluster Applications

Controlling the Speed of Application Failover

Appendix B 341

Another example is an application where a clerk is entering data about a

new employee. Suppose this application requires that employee numbers

be unique, and that after the name and number of the new employee is

entered, a failure occurs. Since the employee number had been entered

before the failure, does the application refuse to allow it to be re-entered?

Does it require that the partially entered information be deleted first?

More appropriately, in a highly available environment the application

will allow the clerk to easily restart the entry or to continue at the next

data item.

Use Checkpoints

Design applications to checkpoint complex transactions. A single

transaction from the user's perspective may result in several actual

database transactions. Although this issue is related to restartable

transactions, here it is advisable to record progress locally on the client

so that a transaction that was interrupted by a system failure can be

completed after the failover occurs.

For example, suppose the application being used is calculating PI. On the

original system, the application has gotten to the 1,000th decimal point,

but the application has not yet written anything to disk. At that moment

in time, the node crashes. The application is restarted on the second

node, but the application is started up from scratch. The application

must recalculate those 1,000 decimal points. However, if the application

had written to disk the decimal points on a regular basis, the application

could have restarted from where it left off.

Balance Checkpoint Frequency with Performance

It is important to balance checkpoint frequency with performance. The

trade-off with checkpointing to disk is the impact of this checkpointing

on performance. Obviously if you checkpoint too often the application

slows; if you don't checkpoint often enough, it will take longer to get the

application back to its current state after a failover. Ideally, the end-user

should be able to decide how often to checkpoint. Applications should

provide customizable parameters so the end-user can tune the

checkpoint frequency.