Managing HP Serviceguard for Linux, Tenth Edition, September 2012

A reboot is also initiated by Serviceguard itself under specific circumstances; see
“Responses to Package and Service Failures ” (page 89).
What Happens when a Node Times Out
Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth
of the value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You
configure MEMBER_TIMEOUT in the cluster configuration file; see “Cluster Configuration
Parameters ” (page 103). The heartbeat interval is not directly configurable. If a node
fails to send a heartbeat message within the time set by MEMBER_TIMEOUT, the cluster
is reformed minus the node no longer sending heartbeat messages.
When a node detects that another node has failed (that is, no heartbeat message has
arrived within MEMBER_TIMEOUT microseconds), the following sequence of events occurs:
1. The node contacts the other nodes and tries to re-form the cluster without the failed
node.
2. If the remaining nodes are a majority or can obtain the cluster lock, they form a new
cluster without the failed node.
3. If the remaining nodes are not a majority or cannot get the cluster lock, they halt
(system reset).
Example
Situation. Assume a two-node cluster, with Package1 running on SystemA and
Package2 running on SystemB. Volume group vg01 is exclusively activated on
SystemA; volume group vg02is exclusively activated on SystemB. Package IP addresses
are assigned to SystemA and SystemB respectively.
Failure. Only one LAN has been configured for both heartbeat and data traffic. During
the course of operations, heavy application traffic monopolizes the bandwidth of the
network, preventing heartbeat packets from getting through.
Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts
to re-form as a one-node cluster. Likewise, since SystemB does not receive heartbeat
messages from SystemA, SystemB also attempts to reform as a one-node cluster. During
the election protocol, each node votes for itself, giving both nodes 50 percent of the
vote. Because both nodes have 50 percent of the vote, both nodes now vie for the cluster
lock. Only one node will get the lock.
Outcome. Assume SystemA gets the cluster lock. SystemA re-forms as a one-node
cluster. After re-formation, SystemA will make sure all applications configured to run
on an existing clustered node are running. When SystemA discovers Package2 is not
running in the cluster it will try to start Package2 if Package2 is configured to run on
SystemA.
SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the
cluster. To release all resources related toPackage2 (such as exclusive access to volume
Responses to Failures 87