Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Understanding Serviceguard Software Components

Responses to Failures

Chapter 3 81

Responses to Failures

HP Serviceguard responds to different kinds of failures in specific ways.

For most hardware failures, the response is not user-configurable, but for

package and service failures, you can choose the system’s response,

within limits.

Transfer of Control (TOC) When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is a

Linux TOC (Transfer of Control), which is an immediate halt of the SPU

without a graceful shutdown. This TOC is done to protect the integrity of

your data.

A TOC is done if a cluster node cannot communicate with the majority of

cluster members for the pre-determined time, or if there is a kernel hang,

a kernel spin, a runaway real-time process, or if the HP Serviceguard

cluster daemon, cmcld, fails. During this event, the following message is

sent to the console:

DEADMAN: Time expired, initiating system restart.

A TOC is also initiated by Serviceguard itself under specific

circumstances. If the service failfast parameter is enabled in the package

configuration file, the entire node will fail with a TOC whenever there is

a failure of that specific service. If NODE_FAIL_FAST_ENABLED is set to

YES in the package configuration file, the entire node will fail with a TOC

whenever there is a timeout or a failure causing the package control

script to exit with a value other than 0 or 1. In addition, a node-level

failure may also be caused by events independent of a package and its

services. Loss of the heartbeat or loss of the cluster daemon (cmcld) or

other critical daemons will cause a node to fail even when its packages

and their services are functioning.

In some cases, an attempt is first made to reboot the system prior to the

TOC. If the reboot is able to complete before the safety timer expires,

then the TOC will not take place. In either case, packages are able to

move quickly to another node.