Managing HP Serviceguard for Linux Ninth Edition, April 2009

SystemA; volume group vg02is exclusively activated on SystemB. Package IP
addresses are assigned to SystemA and SystemB respectively.
Failure. Only one LAN has been configured for both heartbeat and data traffic. During
the course of operations, heavy application traffic monopolizes the bandwidth of the
network, preventing heartbeat packets from getting through.
Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts
to re-form as a one-node cluster. Likewise, since SystemB does not receive heartbeat
messages from SystemA, SystemB also attempts to reform as a one-node cluster.
During the election protocol, each node votes for itself, giving both nodes 50 percent
of the vote. Because both nodes have 50 percent of the vote, both nodes now vie for the
cluster lock. Only one node will get the lock.
Outcome. Assume SystemA gets the cluster lock. SystemA re-forms as a one-node
cluster. After re-formation, SystemA will make sure all applications configured to run
on an existing clustered node are running. When SystemA discovers Package2 is not
running in the cluster it will try to start Package2 if Package2 is configured to run
on SystemA.
SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the
cluster. To release all resources related toPackage2 (such as exclusive access to volume
group vg02 and the Package2 IP address) as quickly as possible, SystemB halts
(system reset).
NOTE: If AUTOSTART_CMCLD in /etc/rc.config.d/cmcluster
($SGAUTOSTART) is set to zero, the node will not attempt to join the cluster when it
comes back up.
For more information on cluster failover, see the white paper Optimizing Failover Time
in a Serviceguard Environment (version A.11.19 and later) at http://www.docs.hp.com
-> High Availability -> Serviceguard -> White Papers. For
troubleshooting information, see “Cluster Re-formations Caused by
MEMBER_TIMEOUT Being Set too Low” (page 282).
Responses to Hardware Failures
If a serious system problem occurs, such as a system panic or physical disruption of
the SPU's circuits, Serviceguard recognizes a node failure and transfers the packages
currently running on that node to an adoptive node elsewhere in the cluster. (System
multi-node and multi-node packages do not fail over.)
The new location for each package is determined by that package's configuration file,
which lists primary and alternate nodes for the package. Transfer of a package to
another node does not transfer the program counter. Processes in a transferred package
will restart from the beginning. In order for an application to be expeditiously restarted
after a failure, it must be “crash-tolerant”; that is, all processes in the package must be
Responses to Failures 89