Managing HP Serviceguard A.11.20.10 for Linux, December 2012

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux RH AS ProLiant Cluster

251

252

253

254

255

256

257

258

259

260

• Minimize the reentry of data.

• Engineer the system for reserve capacity to minimize the performance degradation experienced

by users.

A.1.2 Define Application Startup and Shutdown

Applications must be restartable without manual intervention. If the application requires a switch

to be flipped on a piece of hardware, then automated restart is impossible. Procedures for

application startup, shutdown and monitoring must be created so that the HA software can perform

these functions automatically.

To ensure automated response, there should be defined procedures for starting up the application

and stopping the application. In Serviceguard these procedures are placed in the package control

script. These procedures must check for errors and return status to the HA control software. The

startup and shutdown should be command-line driven and not interactive unless all of the answers

can be predetermined and scripted.

In an HA failover environment, HA software restarts the application on a surviving system in the

cluster that has the necessary resources, such as access to the necessary disk drives. The application

must be restartable in two aspects:

• It must be able to restart and recover on the backup system (or on the same system if the

application restart option is chosen).

• It must be able to restart if it fails during the startup and the cause of the failure is resolved.

Application administrators need to learn to startup and shutdown applications using the appropriate

HA commands. Inadvertently shutting down the application directly will initiate an unwanted

failover. Application administrators also need to be careful that they don't accidently shut down

a production instance of an application rather than a test instance in a development environment.

A mechanism to monitor whether the application is active is necessary so that the HA software

knows when the application has failed. This may be as simple as a script that issues the command

ps -ef | grep xxx for all the processes belonging to the application.

To reduce the impact on users, the application should not simply abort in case of error, since

aborting would cause an unneeded failover to a backup system. Applications should determine

the exact error and take specific action to recover from the error rather than, for example, aborting

upon receipt of any error.

A.2 Controlling the Speed of Application Failover

What steps can be taken to ensure the fastest failover?

If a failure does occur causing the application to be moved (failed over) to another node, there

are many things the application can do to reduce the amount of time it takes to get the application

back up and running. The topics covered are as follows:

• Replicate Non-Data File Systems

• Use Raw Volumes

• Evaluate the Use of a journaled file system

• Minimize Data Loss

• Use Restartable Transactions

• Use Checkpoints

• Design for Multiple Servers

• Design for Replicated Data Sites

A.2.1 Replicate Non-Data File Systems

Non-data file systems should be replicated rather than shared. There can only be one copy of the

application data itself. It will be located on a set of disks that is accessed by the system that is

258 Designing Highly Available Cluster Applications