Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

4
Figure 2. Steps in a failover caused by a failed node Serviceguard with Serviceguard Extension for RAC implementation
Note: Diagram is not to scale.
The two application-dependent steps for a RAC implementation are:
Group membership reconfigurationIf there is a change in membership, RAC starts the
reconfiguration.
RAC reconfiguration and database recoveryAfter a cluster membership change, RAC reassigns
the database locks that were on failed nodes and restarts the databases.
Node failure detection
A node may become unreachable for many reasons. There may be a transient interruption from which
the node can recover automatically in a short time, such as a spike in network activity, I/O, or CPU,
or a temporary kernel hang. Or there may be a failure from which the node cannot recover
automatically or quickly, such as hardware or power supply failure, or an operating system crash.
No matter what the cause, if a node does not get a heartbeat message from another node during the
time specified for MEMBER_TIMEOUT, it will declare the other node unreachable and begin the
process of re-forming the cluster without the failed node. For more information about the process, see
“What Happens When a Node Times Out” in chapter 3 of Managing Serviceguard. (You can find
the latest version of that manual at the address given at the end of this document under “Error!
Reference source not found.”.)
Cluster reformation time
Cluster reformation time includes three components:
1. Election of Cluster Membership
2. Lock Acquisition
3. Quiescence
Each of these is discussed in the sections that follow.
The cluster reformation is done aggressively so that new membership can be formed quickly. During
this time heartbeats are also exchanged more frequently. During the first two steps, temporary failures
(such as those due to a network spike or system hang) that are greater than 1/10
th
of
MEMBER_TIMEOUT can result in node failure. This should be an important consideration in deciding
the MEMBER_TIMEOUT value; see “MEMBER_TIMEOUT value”.
Quiescence
Node
failure
detection
Cluster component
recovery
Group membership
reconfiguration
RAC reconfiguration
and database
recovery
Serviceguard
failover time
Application
failover time
Lock
acquisition
Election
luster
reformation