Managing HP Serviceguard for Linux Ninth Edition, April 2009

Responses to Failures
Serviceguard responds to different kinds of failures in specific ways. For most hardware
failures, the response is not user-configurable, but for package and service failures,
you can choose the system’s response, within limits.
Reboot When a Node Fails
The most dramatic response to a failure in a Serviceguard cluster is a system reboot.
This allows packages to move quickly to another node, protecting the integrity of the
data.
A reboot is done if a cluster node cannot communicate with the majority of cluster
members for the pre-determined time, or under other circumstances such as a kernel
hang or failure of the cluster daemon (cmcld). When this happens, you may see the
following message on the console:
DEADMAN: Time expired, initiating system restart.
The case is covered in more detail under “What Happens when a Node Times Out”.
See also “Cluster Daemon: cmcld” (page 39).
A reboot is also initiated by Serviceguard itself under specific circumstances; see
“Responses to Package and Service Failures ” (page 90).
What Happens when a Node Times Out
Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth
of the value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You
configure MEMBER_TIMEOUT in the cluster configuration file; see “Cluster
Configuration Parameters (page 100). The heartbeat interval is not directly configurable.
If a node fails to send a heartbeat message within the time set by MEMBER_TIMEOUT,
the cluster is reformed minus the node no longer sending heartbeat messages.
When a node detects that another node has failed (that is, no heartbeat message has
arrived within MEMBER_TIMEOUT microseconds), the following sequence of events
occurs:
1. The node contacts the other nodes and tries to re-form the cluster without the
failed node.
2. If the remaining nodes are a majority or can obtain the cluster lock, they form a
new cluster without the failed node.
3. If the remaining nodes are not a majority or cannot get the cluster lock, they halt
(system reset).
Example
Situation. Assume a two-node cluster, with Package1 running on SystemA and
Package2 running on SystemB. Volume group vg01 is exclusively activated on
88 Understanding Serviceguard Software Components