Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux ProLiant Cluster

Some help in estimating time for failover

The following table can help you estimate the total failover time for your Serviceguard cluster.

Table 1. Estimating total failover time for a Serviceguard cluster

Failover component name Time estimate

Resource failure detection

Network failure detection

EMS resource failure detection

Service failure detection

10 Seconds

RESOURCE_POLLING_INTERVAL

Immediately

Node failure detection MEMBER_TIMEOUT

Cluster membership re-formation time (Serviceguard

component of failover time in case of node failure)

This depends mostly on MEMBER_TIMEOUT. It is also affected

by the use of lock device and number of cluster nodes (two

vs. more than two)

Check the time with the cmviewcl command. After configuring

a cluster, issue the command cmviewcl –v –f line and observe

the output. The value next to max_reformation_duration is the

cluster membership re-formation time.

Cluster component recovery

This depends on the number of EMS resources, packages,

nodes, etc. For environments using LVM or CVM 3.5 this time

is usually less than a second. For environments using VERITAS

CVM 4.1 or later, or VERITAS CFS, VERITAS components,

depending on the failure type, this time can range from 5

seconds to an additional cluster reformation time.

Resource recovery

This depends on the number of volume groups, IP addresses,

services, etc. It usually ranges from a low of less than

1 second to a high of several minutes.

Application startup and recovery time This is totally dependent on the application.

MEMBER_TIMEOUT value

To help optimize failover time, first consider fine-tuning the setting for MEMBER_TIMEOUT in your

cluster configuration file. Changing this probably will make the greatest difference in the Serviceguard

component of cluster failover time. When a node times out, Serviceguard declares the node failed

and begins cluster re-formation.

For Serviceguard, the range of supported values of MEMBER_TIMEOUT is 3 to 300 seconds. With

single heartbeat network, the minimum supported value of MEMBER_TIEMOUT is 14 seconds. For

most installations, a 10 to 25-second MEMBER_TIMEOUT is suitable.

Reducing the MEMBER_TIMEOUT decreases the time to detect node failures, which can decrease the

total failover time. However, a small MEMBER_TIMEOUT value also introduces a risk. If there are

temporary interruptions and you set the timeout value so low that the node cannot recover

communication, the node might fail unnecessarily. During cluster reformation, for a short duration a

more aggressive response is required (1/10

of MEMBER_TIMEOUT). The member timeout value

should be large enough to allow response requirements during reformation to be met.

Setting the parameters too low may cause failovers that you could avoid. If you set your parameters

so low that failure is detected before an unreachable node can recover from a temporary interruption,

the node will be forcibly rebooted. Any packages running on it will be failed-over to another node.

The following syslog message is one of the indications that your MEMBER_TIMEOUT value is set too

low: “Warning: cmcld process was unable to run for the last <xxx> seconds.”