Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009
3
Figure 1. Steps in a failover caused by a failed node— Serviceguard implementation
Note: Diagram is not to scale. Application failover time varies by application.
The Serviceguard component of the total failover time when it is caused by node failure (not a
package failure) is composed of: node failure detection, election, lock acquisition, quiescence, and
cluster component recovery.
• Node failure detection—Serviceguard notices that a cluster node is not in communication with the
other cluster nodes. It begins to re-form the cluster without the failed node.
• Election—The cluster nodes decide which nodes will be in the re-formed cluster.
• Lock acquisition—If more than one group of nodes wants to re-form the cluster and no group has
a clear majority of members, the first group to reach the cluster lock re-forms the cluster.
• Quiescence—During this quiet waiting time, non-members of the newly formed cluster are rebooted.
• Cluster component recovery—Serviceguard does miscellaneous tasks, such as cluster information
synchronization and package determination, before the cluster resumes the work.
During the application-dependent phase of the failover time, Serviceguard starts the package as
defined by the user. In Serviceguard implementations, there are two steps, as shown in Figure 1.
• Resource recovery—The package’s resources are made available.
• Application recovery—If applications or processes were moved to a new node, they are restarted.
As shown in Figure 2, the application-dependent steps are a little different for an Oracle® Real
Application Cluster (RAC) package in a Serviceguard cluster with Serviceguard Extension for RAC.
Quiescence
Node
failure
detection
Cluster component
recovery
Resource recovery
(VG, FS, IP)
Application
recovery
Serviceguard
failover time
Application
failover time
Lock
acquisition
Election
Cluster
reformation