Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

12
3. Dual LVM volume group locks on FC disks: 13 seconds + time calculated in option 1 above
Quorum server considerations
If you choose a quorum server, be sure the network between it and the nodes is highly available and
reliable. If you encounter delays reaching the quorum server, you can configure a
QS_TIMEOUT_EXTENSION, but the extension time adds directly to the lock acquisition time and thus
to failover time.
Serviceguard calculates the time for the actual lock acquisition. You cannot change it directly, but you
can configure a faster lock device to reduce the lock acquisition time.
Heartbeat subnet
If you have one heartbeat configured, with the required standby LAN, you need to set at least 14
seconds (21 seconds for IPoIB network interface) for MEMBER_TIMEOUT. For a lower
MEMBER_TIMEOUT value, you must configure multiple heartbeats. Since heartbeat messages are sent
over all heartbeat subnets concurrently, there will be no wait for network switching if a primary LAN
fails. To avoid delays from busy networks, configure at least one private dedicated network for
heartbeat.
Network failure detection
The NETWORK_POLLING_INTERVAL specifies how often Serviceguard checks its configured
networks. In general, the default works best.
Number of nodes and number of packages
The number of nodes and number of packages can affect the cluster reformation time.
A Serviceguard cluster consisting of two nodes has smaller cluster reformation time than cluster with
more than two nodes. There is no election of cluster membership in the case of two-node cluster.
The number of packages has a slight effect. During resource recovery, the Package Manager has
two tasks. First, it checks the packages to determine which ones failed. The more packages there
are, the more time this could take. Then it needs to determine which nodes should adopt the
packages and run them after re-formation. The more packages each node has, the more time this
could take.
EMS resources
There are two factors to consider about your EMS resources:
EMS resource monitor detection timeThis depends entirely on the EMS resource monitor and how
it works. Look at the monitor’s documentation; usually you can set this time.
The time for the EMS message to get to ServiceguardAt most, this takes as long as the time set for
RESOURCE_POLLING_INTERVAL. In the package configuration file, you want to set the interval low
enough to discover failure quickly. However, if you set it too low, frequent polling just makes the
network and the system busier.
Package configuration
When many storage units are involved, you might be able to reduce resource recovery time to help
optimize failover. Refer to the section “Optimizing for Large Numbers of Storage Units” in chapter 6
of the Managing Serviceguard manual for your version of Serviceguard. Manuals are available from
www.docs.hp.com/hpux/ha à Serviceguard.
The type of file system can greatly reduce the time it takes for file consistency checks. For packages on
HP-UX 11I, VERITAS File System (VxFS) from Symantec is faster than HFS, and CFS is even faster. On