Managing HP Serviceguard for Linux Ninth Edition, April 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux RH AS ProLiant Cluster

291

292

293

294

295

296

297

298

299

300

reprinting all the paychecks, does the job start from where it left off, or does the

scheduler assume that the job was done and not print the last hours worth of paychecks?

The correct behavior in a highly available environment is to restart where it left off,

ensuring that everyone gets one and only one paycheck.

Another example is an application where a clerk is entering data about a new employee.

Suppose this application requires that employee numbers be unique, and that after the

name and number of the new employee is entered, a failure occurs. Since the employee

number had been entered before the failure, does the application refuse to allow it to

be re-entered? Does it require that the partially entered information be deleted first?

More appropriately, in a highly available environment the application will allow the

clerk to easily restart the entry or to continue at the next data item.

Use Checkpoints

Design applications to checkpoint complex transactions. A single transaction from the

user's perspective may result in several actual database transactions. Although this

issue is related to restartable transactions, here it is advisable to record progress locally

on the client so that a transaction that was interrupted by a system failure can be

completed after the failover occurs.

For example, suppose the application being used is calculating PI. On the original

system, the application has gotten to the 1,000th decimal point, but the application has

not yet written anything to disk. At that moment in time, the node crashes. The

application is restarted on the second node, but the application is started up from

scratch. The application must recalculate those 1,000 decimal points. However, if the

application had written to disk the decimal points on a regular basis, the application

could have restarted from where it left off.

Balance Checkpoint Frequency with Performance

It is important to balance checkpoint frequency with performance. The trade-off with

checkpointing to disk is the impact of this checkpointing on performance. Obviously

if you checkpoint too often the application slows; if you don't checkpoint often enough,

it will take longer to get the application back to its current state after a failover. Ideally,

the end-user should be able to decide how often to checkpoint. Applications should

provide customizable parameters so the end-user can tune the checkpoint frequency.

Design for Multiple Servers

If you use multiple active servers, multiple service points can provide relatively

transparent service to a client. However, this capability requires that the client be smart

enough to have knowledge about the multiple servers and the priority for addressing

them. It also requires access to the data of the failed server or replicated data.

For example, rather than having a single application which fails over to a second system,

consider having both systems running the application. After a failure of the first system,

the second system simply takes over the load of the first system. This eliminates the

start up time of the application. There are many ways to design this sort of architecture,

and there are also many issues with this sort of design. This discussion will not go into

details other than to give a few examples.

Controlling the Speed of Application Failover 293