Managing HP Serviceguard for Linux, Seventh Edition, July 2007

Understanding Serviceguard Software Components
Responses to Failures
Chapter 384
Responses to Failures
HP Serviceguard responds to different kinds of failures in specific ways.
For most hardware failures, the response is not user-configurable, but for
package and service failures, you can choose the system’s response,
within limits.
Reboot When a Node Fails
The most dramatic response to a failure in a Serviceguard cluster is a
system reboot. This allows packages to move quickly to another node,
protecting the integrity of the data.
A reboot is done if a cluster node cannot communicate with the majority
of cluster members for the pre-determined time, or under other
circumstances such as a kernel hang or failure of the cluster daemon
(cmcld). When this happens, you may see the following message on the
console:
DEADMAN: Time expired, initiating system restart.
The case is covered in more detail under “What Happens when a Node
Times Out”. See also “Cluster Daemon: cmcld” on page 35.
A reboot is also initiated by Serviceguard itself under specific
circumstances; see “Responses to Package and Service Failures” on
page 87.
What Happens when a Node Times Out
Each node sends a heartbeat message to the cluster coordinator every
HEARTBEAT_INTERVAL number of microseconds (as specified in the
cluster configuration file). The cluster coordinator looks for this message
from each node, and if it does not receive it within NODE_TIMEOUT
microseconds, the cluster is reformed minus the node no longer sending
heartbeat messages. (See the HEARTBEAT_INTERVAL and NODE_TIMEOUT
entries under “Cluster Configuration Parameters” on page 106 for advice
about configuring these parameters.)
On a node that is not the cluster coordinator, and on which a node
timeout occurs (that is, no heartbeat message has arrived within
NODE_TIMEOUT seconds), the following sequence of events occurs: