Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

If your application takes a long time to recover and restart, set the MEMBER_TIMEOUT conservatively.

If your database takes several minutes to recover, it isn’t worth risking an unnecessary failover to

shave a few seconds off of the Serviceguard failover time.

If your application restarts quickly, you can afford to set the MEMBER_TIMEOUT more aggressively.

When your application only takes a few seconds to restart, there is little benefit in waiting a few

seconds for an interruption to recover. You can afford to try a short timeout when you have short

recovery and restart times.

Small, lightly loaded systems are likely to have fewer interruptions, and they are likely to recover more

quickly. Highly loaded systems with a large number of disks are likely to have more frequent

interruptions, and they are likely to take longer to recover. Try to spread the load and avoid spikes

in activity. Set the MEMBER_TIMEOUT to allow recovery time for interruptions when the load is

heaviest.

Virtual partitions may have different latency characteristics than independent nodes due to hardware

and firmware sharing. If you will be using virtual partitions in your cluster, there may be additional

considerations when testing the MEMBER_TIMEOUT value for your configuration. For more

information, see the white paper “Serviceguard Cluster Configuration for Partitioned Systems”,

available from the HP Technical Documentation site at www.docs.hp.com/hpux/ha.

Testing

To fine-tune the parameters, it is important to test the cluster in an environment that imitates the actual

production environment. Test the cluster, running all of its packages, with the heaviest expected loads

on networks, CPU, and I/O.

To time failover, force each package to fail over to another node. One way to force a failover and

cluster reformation is to power off the node where the package is running. Read the logs of the

failover, noting the time stamps.

Change parameters in small increments, then re-evaluate and re-test. Check the system log. If there

are indications of interruptions or transient problems, try to determine the recovery time. Look for

system log messages to see if there are messages like: “Warning: cmcld process was unable to run

for the last <xxx> seconds.”

If you see this message, it means you have reached the lower limit of MEMBER_TIMEOUT.

Try different settings for the MEMBER_TIMEOUT until you find the optimal one. You want a value that

results in the shortest failover time without any unnecessary failovers from recoverable temporary

problems.

Allow a margin of safety for the tested MEMBER_TIMEOUT value. How much time to allow depends

on how closely your test environment reflects your actual environment at its busiest.

Re-test and re-evaluate your settings periodically, especially when new disks, new networks, or new

applications are added to the cluster. Monitor traffic on heartbeat networks; watch for increases in

traffic, especially ones that could cause temporary spikes.

Lock acquisition (cluster lock, also called tie-breaker or arbitrator)

Your choice of lock device may help you to optimize failover time. The lock acquisition time is part of

failover time. Choosing the device with lowest lock acquisition time will reduce the failover time. The

Lock acquisition times for different devices are:

1. Quorum Server, LVM volume group lock on Fiber Channel (FC) disk, or LOCK LUN on FC

disk: an internally calculated percentage of MEMBER_TIMEOUT

2. LVM volume group lock on single SCSI disk, or LOCK LUN on single SCSI disk: 5 seconds +

time calculated in option 1 above