Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

10
Some help in estimating time for failover
The following table can help you estimate the total failover time for your Serviceguard cluster.
Table 1. Estimating total failover time for a Serviceguard cluster
Failover component name Time estimate
Resource failure detection
Network failure detection
EMS resource failure detection
Service failure detection
10 Seconds
RESOURCE_POLLING_INTERVAL
Immediately
Node failure detection MEMBER_TIMEOUT
Cluster membership re-formation time (Serviceguard
component of failover time in case of node failure)
This depends mostly on MEMBER_TIMEOUT. It is also affected
by the use of lock device and number of cluster nodes (two
vs. more than two)
Check the time with the cmviewcl command. After configuring
a cluster, issue the command cmviewcl v f line and observe
the output. The value next to max_reformation_duration is the
cluster membership re-formation time.
Cluster component recovery
This depends on the number of EMS resources, packages,
nodes, etc. For environments using LVM or CVM 3.5 this time
is usually less than a second. For environments using VERITAS
CVM 4.1 or later, or VERITAS CFS, VERITAS components,
depending on the failure type, this time can range from 5
seconds to an additional cluster reformation time.
Resource recovery
This depends on the number of volume groups, IP addresses,
services, etc. It usually ranges from a low of less than
1 second to a high of several minutes.
Application startup and recovery time This is totally dependent on the application.
MEMBER_TIMEOUT value
To help optimize failover time, first consider fine-tuning the setting for MEMBER_TIMEOUT in your
cluster configuration file. Changing this probably will make the greatest difference in the Serviceguard
component of cluster failover time. When a node times out, Serviceguard declares the node failed
and begins cluster re-formation.
For Serviceguard, the range of supported values of MEMBER_TIMEOUT is 3 to 300 seconds. With
single heartbeat network, the minimum supported value of MEMBER_TIEMOUT is 14 seconds. For
most installations, a 10 to 25-second MEMBER_TIMEOUT is suitable.
Reducing the MEMBER_TIMEOUT decreases the time to detect node failures, which can decrease the
total failover time. However, a small MEMBER_TIMEOUT value also introduces a risk. If there are
temporary interruptions and you set the timeout value so low that the node cannot recover
communication, the node might fail unnecessarily. During cluster reformation, for a short duration a
more aggressive response is required (1/10
th
of MEMBER_TIMEOUT). The member timeout value
should be large enough to allow response requirements during reformation to be met.
Setting the parameters too low may cause failovers that you could avoid. If you set your parameters
so low that failure is detected before an unreachable node can recover from a temporary interruption,
the node will be forcibly rebooted. Any packages running on it will be failed-over to another node.
The following syslog message is one of the indications that your MEMBER_TIMEOUT value is set too
low: “Warning: cmcld process was unable to run for the last <xxx> seconds.”