Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Designing Highly Available Cluster Applications
Controlling the Speed of Application Failover
Appendix B302
Keep Logs Small
Some databases permit logs to be buffered in memory to increase online
performance. Of course, when a failure occurs, any in-flight transaction
will be lost. However, minimizing the size of this in-memory log will
reduce the amount of completed transaction data that would be lost in
case of failure.
Keeping the size of the on-disk log small allows the log to be archived or
replicated more frequently, reducing the risk of data loss if a disaster
were to occur. There is, of course, a trade-off between online performance
and the size of the log.
Eliminate Need for Local Data
When possible, eliminate the need for local data. In a three-tier,
client/server environment, the middle tier can often be dataless (i.e.,
there is no local data that is client specific or needs to be modified). This
“application server” tier can then provide additional levels of availability,
load-balancing, and failover. However, this scenario requires that all
data be stored either on the client (tier 1) or on the database server (tier
3).
Use Restartable Transactions
Transactions need to be restartable so that the client does not need to
re-enter or back out of the transaction when a server fails, and the
application is restarted on another system. In other words, if a failure
occurs in the middle of a transaction, there should be no need to start
over again from the beginning. This capability makes the application
more robust and reduces the visibility of a failover to the user.
A common example is a print job. Printer applications typically schedule
jobs. When that job completes, the scheduler goes on to the next job. If,
however, the system dies in the middle of a long job (say it is printing
paychecks for 3 hours), what happens when the system comes back up
again? Does the job restart from the beginning, reprinting all the
paychecks, does the job start from where it left off, or does the scheduler
assume that the job was done and not print the last hours worth of
paychecks? The correct behavior in a highly available environment is to
restart where it left off, ensuring that everyone gets one and only one
paycheck.