Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Understanding Serviceguard Software Components

Responses to Failures

Chapter 3 83

Responses to Package and Service Failures

In the default case, the failure of the package or of a service within a

package causes the package to shut down by running the control script

with the 'stop' parameter, and then restarting the package on an

alternate node.

If you wish, you can modify this default behavior by specifying that the

node should crash (TOC) before the transfer takes place. If this behavior

is specified, HP Serviceguard will attempt to reboot the system prior to a

TOC. If there is enough time to flush the buffers in the buffer cache, the

reboot is successful, and a TOC does not take place. Either way, the

system will be guaranteed to come down within a predetermined number

of seconds.

In cases where package shutdown might hang, leaving the node in an

unknown state, the use of a Failfast option can provide a quick failover,

after which the node will be cleaned up on reboot. Remember, however,

that when the node crashes, all packages on the node are halted

abruptly.

The settings of node and service failfast parameters during package

configuration will determine the exact behavior of the package and the

node in the event of failure. The section on “Package Configuration

Parameters” in the “Planning” chapter contains details on how to choose

an appropriate failover behavior.

Service Restarts

You can allow a service to restart locally following a failure. To do this,

you indicate a number of restarts for each service in the package control

script. When a service starts, the variable RESTART_COUNT is set in the

service's environment. The service, as it executes, can examine this

variable to see whether it has been restarted after a failure, and if so, it

can take appropriate action such as cleanup.

Network Communication Failure

An important element in the cluster is the health of the network itself.

As it continuously monitors the cluster, each node listens for heartbeat

messages from the other nodes confirming that all nodes are able to

communicate with each other. If a node does not hear these messages

within the configured amount of time, a node timeout occurs, resulting in

a cluster re-formation and later, if there are still no heartbeat messages

received, a TOC.