Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

11
If your application takes a long time to recover and restart, set the MEMBER_TIMEOUT conservatively.
If your database takes several minutes to recover, it isn’t worth risking an unnecessary failover to
shave a few seconds off of the Serviceguard failover time.
If your application restarts quickly, you can afford to set the MEMBER_TIMEOUT more aggressively.
When your application only takes a few seconds to restart, there is little benefit in waiting a few
seconds for an interruption to recover. You can afford to try a short timeout when you have short
recovery and restart times.
Small, lightly loaded systems are likely to have fewer interruptions, and they are likely to recover more
quickly. Highly loaded systems with a large number of disks are likely to have more frequent
interruptions, and they are likely to take longer to recover. Try to spread the load and avoid spikes
in activity. Set the MEMBER_TIMEOUT to allow recovery time for interruptions when the load is
heaviest.
Virtual partitions may have different latency characteristics than independent nodes due to hardware
and firmware sharing. If you will be using virtual partitions in your cluster, there may be additional
considerations when testing the MEMBER_TIMEOUT value for your configuration. For more
information, see the white paper “Serviceguard Cluster Configuration for Partitioned Systems”,
available from the HP Technical Documentation site at www.docs.hp.com/hpux/ha.
Testing
To fine-tune the parameters, it is important to test the cluster in an environment that imitates the actual
production environment. Test the cluster, running all of its packages, with the heaviest expected loads
on networks, CPU, and I/O.
To time failover, force each package to fail over to another node. One way to force a failover and
cluster reformation is to power off the node where the package is running. Read the logs of the
failover, noting the time stamps.
Change parameters in small increments, then re-evaluate and re-test. Check the system log. If there
are indications of interruptions or transient problems, try to determine the recovery time. Look for
system log messages to see if there are messages like: “Warning: cmcld process was unable to run
for the last <xxx> seconds.”
If you see this message, it means you have reached the lower limit of MEMBER_TIMEOUT.
Try different settings for the MEMBER_TIMEOUT until you find the optimal one. You want a value that
results in the shortest failover time without any unnecessary failovers from recoverable temporary
problems.
Allow a margin of safety for the tested MEMBER_TIMEOUT value. How much time to allow depends
on how closely your test environment reflects your actual environment at its busiest.
Re-test and re-evaluate your settings periodically, especially when new disks, new networks, or new
applications are added to the cluster. Monitor traffic on heartbeat networks; watch for increases in
traffic, especially ones that could cause temporary spikes.
Lock acquisition (cluster lock, also called tie-breaker or arbitrator)
Your choice of lock device may help you to optimize failover time. The lock acquisition time is part of
failover time. Choosing the device with lowest lock acquisition time will reduce the failover time. The
Lock acquisition times for different devices are:
1. Quorum Server, LVM volume group lock on Fiber Channel (FC) disk, or LOCK LUN on FC
disk: an internally calculated percentage of MEMBER_TIMEOUT
2. LVM volume group lock on single SCSI disk, or LOCK LUN on single SCSI disk: 5 seconds +
time calculated in option 1 above