Managing HP Serviceguard A.11.20.10 for Linux, December 2012

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux Cluster

251

252

253

254

255

256

257

258

259

260

the beginning. This capability makes the application more robust and reduces the visibility of a

failover to the user.

A common example is a print job. Printer applications typically schedule jobs. When that job

completes, the scheduler goes on to the next job. If, however, the system dies in the middle of a

long job (say it is printing paychecks for 3 hours), what happens when the system comes back up

again? Does the job restart from the beginning, reprinting all the paychecks, does the job start

from where it left off, or does the scheduler assume that the job was done and not print the last

hours worth of paychecks? The correct behavior in a highly available environment is to restart

where it left off, ensuring that everyone gets one and only one paycheck.

Another example is an application where a clerk is entering data about a new employee. Suppose

this application requires that employee numbers be unique, and that after the name and number

of the new employee is entered, a failure occurs. Since the employee number had been entered

before the failure, does the application refuse to allow it to be re-entered? Does it require that the

partially entered information be deleted first? More appropriately, in a highly available environment

the application will allow the clerk to easily restart the entry or to continue at the next data item.

A.2.5 Use Checkpoints

Design applications to checkpoint complex transactions. A single transaction from the user's

perspective may result in several actual database transactions. Although this issue is related to

restartable transactions, here it is advisable to record progress locally on the client so that a

transaction that was interrupted by a system failure can be completed after the failover occurs.

For example, suppose the application being used is calculating PI. On the original system, the

application has gotten to the 1,000th decimal point, but the application has not yet written anything

to disk. At that moment in time, the node crashes. The application is restarted on the second node,

but the application is started up from scratch. The application must recalculate those 1,000 decimal

points. However, if the application had written to disk the decimal points on a regular basis, the

application could have restarted from where it left off.

A.2.5.1 Balance Checkpoint Frequency with Performance

It is important to balance checkpoint frequency with performance. The trade-off with checkpointing

to disk is the impact of this checkpointing on performance. Obviously if you checkpoint too often

the application slows; if you don't checkpoint often enough, it will take longer to get the application

back to its current state after a failover. Ideally, the end-user should be able to decide how often

to checkpoint. Applications should provide customizable parameters so the end-user can tune the

checkpoint frequency.

A.2.6 Design for Multiple Servers

If you use multiple active servers, multiple service points can provide relatively transparent service

to a client. However, this capability requires that the client be smart enough to have knowledge

about the multiple servers and the priority for addressing them. It also requires access to the data

of the failed server or replicated data.

For example, rather than having a single application which fails over to a second system, consider

having both systems running the application. After a failure of the first system, the second system

simply takes over the load of the first system. This eliminates the start up time of the application.

There are many ways to design this sort of architecture, and there are also many issues with this

sort of design. This discussion will not go into details other than to give a few examples.

The simplest method is to have two applications running in a master/slave relationship where the

slave is simply a hot standby application for the master. When the master fails, the slave on the

second system would still need to figure out what state the data was in (i.e., data recovery would

still take place). However, the time to fork the application and do the initial startup is saved.

Another possibility is having two applications that are both active. An example might be two

application servers which feed a database. Half of the clients connect to one application server

and half of the clients connect to the second application server. If one server fails, then all the

clients connect to the remaining application server.

260 Designing Highly Available Cluster Applications