Managing HP Serviceguard for Linux, Seventh Edition, July 2007

Understanding Serviceguard Software Components

Responses to Failures

Chapter 384

Responses to Failures

HP Serviceguard responds to different kinds of failures in specific ways.

For most hardware failures, the response is not user-configurable, but for

package and service failures, you can choose the system’s response,

within limits.

Reboot When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is a

system reboot. This allows packages to move quickly to another node,

protecting the integrity of the data.

A reboot is done if a cluster node cannot communicate with the majority

of cluster members for the pre-determined time, or under other

circumstances such as a kernel hang or failure of the cluster daemon

(cmcld). When this happens, you may see the following message on the

console:

DEADMAN: Time expired, initiating system restart.

The case is covered in more detail under “What Happens when a Node

Times Out”. See also “Cluster Daemon: cmcld” on page 35.

A reboot is also initiated by Serviceguard itself under specific

circumstances; see “Responses to Package and Service Failures” on

page 87.

What Happens when a Node Times Out

Each node sends a heartbeat message to the cluster coordinator every

HEARTBEAT_INTERVAL number of microseconds (as specified in the

cluster configuration file). The cluster coordinator looks for this message

from each node, and if it does not receive it within NODE_TIMEOUT

microseconds, the cluster is reformed minus the node no longer sending

heartbeat messages. (See the HEARTBEAT_INTERVAL and NODE_TIMEOUT

entries under “Cluster Configuration Parameters” on page 106 for advice

about configuring these parameters.)

On a node that is not the cluster coordinator, and on which a node

timeout occurs (that is, no heartbeat message has arrived within

NODE_TIMEOUT seconds), the following sequence of events occurs: