Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux ProLiant Cluster

Figure 2. Steps in a failover caused by a failed node— Serviceguard with Serviceguard Extension for RAC implementation

Note: Diagram is not to scale.

The two application-dependent steps for a RAC implementation are:

• Group membership reconfiguration—If there is a change in membership, RAC starts the

reconfiguration.

• RAC reconfiguration and database recovery—After a cluster membership change, RAC reassigns

the database locks that were on failed nodes and restarts the databases.

Node failure detection

A node may become unreachable for many reasons. There may be a transient interruption from which

the node can recover automatically in a short time, such as a spike in network activity, I/O, or CPU,

or a temporary kernel hang. Or there may be a failure from which the node cannot recover

automatically or quickly, such as hardware or power supply failure, or an operating system crash.

No matter what the cause, if a node does not get a heartbeat message from another node during the

time specified for MEMBER_TIMEOUT, it will declare the other node unreachable and begin the

process of re-forming the cluster without the failed node. For more information about the process, see

“What Happens When a Node Times Out” in chapter 3 of Managing Serviceguard. (You can find

the latest version of that manual at the address given at the end of this document under “Error!

Reference source not found.”.)

Cluster reformation time

Cluster reformation time includes three components:

1. Election of Cluster Membership

2. Lock Acquisition

3. Quiescence

Each of these is discussed in the sections that follow.

The cluster reformation is done aggressively so that new membership can be formed quickly. During

this time heartbeats are also exchanged more frequently. During the first two steps, temporary failures

(such as those due to a network spike or system hang) that are greater than 1/10

MEMBER_TIMEOUT can result in node failure. This should be an important consideration in deciding

the MEMBER_TIMEOUT value; see “MEMBER_TIMEOUT value”.

Quiescence

Node

failure

detection

Cluster component

recovery

Group membership

reconfiguration

RAC reconfiguration

and database

recovery

Serviceguard

failover time

Application

failover time

Lock

acquisition

Election

luster

reformation