Managing HP Serviceguard for Linux Ninth Edition, April 2009

SystemA; volume group vg02is exclusively activated on SystemB. Package IP

addresses are assigned to SystemA and SystemB respectively.

Failure. Only one LAN has been configured for both heartbeat and data traffic. During

the course of operations, heavy application traffic monopolizes the bandwidth of the

network, preventing heartbeat packets from getting through.

Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts

to re-form as a one-node cluster. Likewise, since SystemB does not receive heartbeat

messages from SystemA, SystemB also attempts to reform as a one-node cluster.

During the election protocol, each node votes for itself, giving both nodes 50 percent

of the vote. Because both nodes have 50 percent of the vote, both nodes now vie for the

cluster lock. Only one node will get the lock.

Outcome. Assume SystemA gets the cluster lock. SystemA re-forms as a one-node

cluster. After re-formation, SystemA will make sure all applications configured to run

on an existing clustered node are running. When SystemA discovers Package2 is not

running in the cluster it will try to start Package2 if Package2 is configured to run

on SystemA.

SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the

cluster. To release all resources related toPackage2 (such as exclusive access to volume

group vg02 and the Package2 IP address) as quickly as possible, SystemB halts

(system reset).

NOTE: If AUTOSTART_CMCLD in /etc/rc.config.d/cmcluster

($SGAUTOSTART) is set to zero, the node will not attempt to join the cluster when it

comes back up.

For more information on cluster failover, see the white paper Optimizing Failover Time

in a Serviceguard Environment (version A.11.19 and later) at http://www.docs.hp.com

-> High Availability -> Serviceguard -> White Papers. For

troubleshooting information, see “Cluster Re-formations Caused by

MEMBER_TIMEOUT Being Set too Low” (page 282).

Responses to Hardware Failures

If a serious system problem occurs, such as a system panic or physical disruption of

the SPU's circuits, Serviceguard recognizes a node failure and transfers the packages

currently running on that node to an adoptive node elsewhere in the cluster. (System

multi-node and multi-node packages do not fail over.)

The new location for each package is determined by that package's configuration file,

which lists primary and alternate nodes for the package. Transfer of a package to

another node does not transfer the program counter. Processes in a transferred package

will restart from the beginning. In order for an application to be expeditiously restarted

after a failure, it must be “crash-tolerant”; that is, all processes in the package must be

Responses to Failures 89