Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

3
Figure 1. Steps in a failover caused by a failed nodeServiceguard implementation
Note: Diagram is not to scale. Application failover time varies by application.
The Serviceguard component of the total failover time when it is caused by node failure (not a
package failure) is composed of: node failure detection, election, lock acquisition, quiescence, and
cluster component recovery.
Node failure detectionServiceguard notices that a cluster node is not in communication with the
other cluster nodes. It begins to re-form the cluster without the failed node.
ElectionThe cluster nodes decide which nodes will be in the re-formed cluster.
Lock acquisitionIf more than one group of nodes wants to re-form the cluster and no group has
a clear majority of members, the first group to reach the cluster lock re-forms the cluster.
QuiescenceDuring this quiet waiting time, non-members of the newly formed cluster are rebooted.
Cluster component recoveryServiceguard does miscellaneous tasks, such as cluster information
synchronization and package determination, before the cluster resumes the work.
During the application-dependent phase of the failover time, Serviceguard starts the package as
defined by the user. In Serviceguard implementations, there are two steps, as shown in Figure 1.
Resource recoveryThe package’s resources are made available.
Application recoveryIf applications or processes were moved to a new node, they are restarted.
As shown in Figure 2, the application-dependent steps are a little different for an Oracle® Real
Application Cluster (RAC) package in a Serviceguard cluster with Serviceguard Extension for RAC.
Quiescence
Node
failure
detection
Cluster component
recovery
Resource recovery
(VG, FS, IP)
Application
recovery
Serviceguard
failover time
Application
failover time
Lock
acquisition
Election
Cluster
reformation