Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

Figure 1. Steps in a failover caused by a failed node— Serviceguard implementation

Note: Diagram is not to scale. Application failover time varies by application.

The Serviceguard component of the total failover time when it is caused by node failure (not a

package failure) is composed of: node failure detection, election, lock acquisition, quiescence, and

cluster component recovery.

• Node failure detection—Serviceguard notices that a cluster node is not in communication with the

other cluster nodes. It begins to re-form the cluster without the failed node.

• Election—The cluster nodes decide which nodes will be in the re-formed cluster.

• Lock acquisition—If more than one group of nodes wants to re-form the cluster and no group has

a clear majority of members, the first group to reach the cluster lock re-forms the cluster.

• Quiescence—During this quiet waiting time, non-members of the newly formed cluster are rebooted.

• Cluster component recovery—Serviceguard does miscellaneous tasks, such as cluster information

synchronization and package determination, before the cluster resumes the work.

During the application-dependent phase of the failover time, Serviceguard starts the package as

defined by the user. In Serviceguard implementations, there are two steps, as shown in Figure 1.

• Resource recovery—The package’s resources are made available.

• Application recovery—If applications or processes were moved to a new node, they are restarted.

As shown in Figure 2, the application-dependent steps are a little different for an Oracle® Real

Application Cluster (RAC) package in a Serviceguard cluster with Serviceguard Extension for RAC.

Quiescence

Node

failure

detection

Cluster component

recovery

Resource recovery

(VG, FS, IP)

Application

recovery

Serviceguard

failover time

Application

failover time

Lock

acquisition

Election

Cluster

reformation