Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Understanding Serviceguard Software Components
Responses to Failures
Chapter 3 81
Responses to Failures
HP Serviceguard responds to different kinds of failures in specific ways.
For most hardware failures, the response is not user-configurable, but for
package and service failures, you can choose the system’s response,
within limits.
Transfer of Control (TOC) When a Node Fails
The most dramatic response to a failure in a Serviceguard cluster is a
Linux TOC (Transfer of Control), which is an immediate halt of the SPU
without a graceful shutdown. This TOC is done to protect the integrity of
your data.
A TOC is done if a cluster node cannot communicate with the majority of
cluster members for the pre-determined time, or if there is a kernel hang,
a kernel spin, a runaway real-time process, or if the HP Serviceguard
cluster daemon, cmcld, fails. During this event, the following message is
sent to the console:
DEADMAN: Time expired, initiating system restart.
A TOC is also initiated by Serviceguard itself under specific
circumstances. If the service failfast parameter is enabled in the package
configuration file, the entire node will fail with a TOC whenever there is
a failure of that specific service. If NODE_FAIL_FAST_ENABLED is set to
YES in the package configuration file, the entire node will fail with a TOC
whenever there is a timeout or a failure causing the package control
script to exit with a value other than 0 or 1. In addition, a node-level
failure may also be caused by events independent of a package and its
services. Loss of the heartbeat or loss of the cluster daemon (cmcld) or
other critical daemons will cause a node to fail even when its packages
and their services are functioning.
In some cases, an attempt is first made to reboot the system prior to the
TOC. If the reboot is able to complete before the safety timer expires,
then the TOC will not take place. In either case, packages are able to
move quickly to another node.