Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

The time required for resource recovery depends on the number of resources and types of resources.

Serviceguard implementation: application startup

Serviceguard completes the application-dependent part of failover by following commands for the

recovery and restart of package applications as defined in package configuration. Some packages

and some applications may also do their own recovery before startup.

The time required for application startup depends on each application and how it is configured.

Servicegaurd with Serviceguard Extension for RAC: group membership reconfiguration

RAC group membership reconfiguration is the same whether failover is caused by a node failure or

a package failure. To start group membership reconfiguration, Serviceguard Extension for RAC

communicates the group membership to Oracle RAC. If there is a change in membership, RAC will

do the reconfiguration.

The time required for group membership reconfiguration depends on the RAC configuration.

Serviceguard with Serviceguard Extension for RAC: RAC reconfiguration and database recovery

After Oracle RAC is notified of a cluster membership change, it starts its own reconfiguration and

recovery.

The time required for RAC reconfiguration and database recovery depends on the RAC configuration.

How you can optimize failover time

There are ways you can optimize the failover process for your environment to reduce the time a

package is unavailable. If the failover process takes longer than necessary, you are not getting

maximum availability. If the failover starts and completes too quickly, however, you may get

unnecessary failovers that reduce performance of a cluster—possibly reducing availability instead

of improving it. It is important to find a balance between the extremes.

The optimal failover time would be long enough to allow for recoverable interruptions, but no longer

than that.

The time required for the Serviceguard portion of failover depends largely on the MEMBER_TIMEOUT

value, but it also depends on the type of lock device and whether you are using standby heartbeat

interfaces. There are ways to fine-tune these factors to help optimize failover.

To set the Serviceguard parameters, you need to determine the likelihood of transient interruptions

and the amount of time it takes them to recover and continue. If your cluster is in a busy environment,

you need to tolerate interruptions or you will get unnecessary—and possibly repeated—failovers. If,

however, your networks and systems do not get overloaded, you can set your failover parameters

more aggressively.

Try to tune the environment so there are fewer interruptions and it takes less time to recover from them.

Consider ways to distribute the workload across the cluster. Consider adding a node to the cluster.

Because your timeout value has to allow for the longest recovery times, you can reduce the timeout

value if you can smooth out the peaks.

Consider the time it takes for the application-dependent component of failover. If your applications

need a short time for recovery and restart, you can afford to set your failover parameters more

aggressively. If your application recovery and restart is quick, it is an advantage to have a quick

reaction to failures. However, if it takes a long time for your applications to restart or for your

databases to recover, set the timeout value more conservatively. Before you start a lengthy failover,

it is an advantage to wait a bit for a transient problem to recover on its own.