Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

9
The time required for resource recovery depends on the number of resources and types of resources.
Serviceguard implementation: application startup
Serviceguard completes the application-dependent part of failover by following commands for the
recovery and restart of package applications as defined in package configuration. Some packages
and some applications may also do their own recovery before startup.
The time required for application startup depends on each application and how it is configured.
Servicegaurd with Serviceguard Extension for RAC: group membership reconfiguration
RAC group membership reconfiguration is the same whether failover is caused by a node failure or
a package failure. To start group membership reconfiguration, Serviceguard Extension for RAC
communicates the group membership to Oracle RAC. If there is a change in membership, RAC will
do the reconfiguration.
The time required for group membership reconfiguration depends on the RAC configuration.
Serviceguard with Serviceguard Extension for RAC: RAC reconfiguration and database recovery
After Oracle RAC is notified of a cluster membership change, it starts its own reconfiguration and
recovery.
The time required for RAC reconfiguration and database recovery depends on the RAC configuration.
How you can optimize failover time
There are ways you can optimize the failover process for your environment to reduce the time a
package is unavailable. If the failover process takes longer than necessary, you are not getting
maximum availability. If the failover starts and completes too quickly, however, you may get
unnecessary failovers that reduce performance of a clusterpossibly reducing availability instead
of improving it. It is important to find a balance between the extremes.
The optimal failover time would be long enough to allow for recoverable interruptions, but no longer
than that.
The time required for the Serviceguard portion of failover depends largely on the MEMBER_TIMEOUT
value, but it also depends on the type of lock device and whether you are using standby heartbeat
interfaces. There are ways to fine-tune these factors to help optimize failover.
To set the Serviceguard parameters, you need to determine the likelihood of transient interruptions
and the amount of time it takes them to recover and continue. If your cluster is in a busy environment,
you need to tolerate interruptions or you will get unnecessaryand possibly repeatedfailovers. If,
however, your networks and systems do not get overloaded, you can set your failover parameters
more aggressively.
Try to tune the environment so there are fewer interruptions and it takes less time to recover from them.
Consider ways to distribute the workload across the cluster. Consider adding a node to the cluster.
Because your timeout value has to allow for the longest recovery times, you can reduce the timeout
value if you can smooth out the peaks.
Consider the time it takes for the application-dependent component of failover. If your applications
need a short time for recovery and restart, you can afford to set your failover parameters more
aggressively. If your application recovery and restart is quick, it is an advantage to have a quick
reaction to failures. However, if it takes a long time for your applications to restart or for your
databases to recover, set the timeout value more conservatively. Before you start a lengthy failover,
it is an advantage to wait a bit for a transient problem to recover on its own.