Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Understanding Serviceguard Software Components

Responses to Failures

Chapter 382

Responses to Hardware Failures

If a serious system problem occurs, such as a system panic or physical

disruption of the SPU's circuits, Serviceguard recognizes a node failure

and transfers the packages currently running on that node to an

adoptive node elsewhere in the cluster. The new location for each

package is determined by that package's configuration file, which lists

primary and alternate nodes for the package. Transfer of a package to

another node does not transfer the program counter. Processes in a

transferred package will restart from the beginning. In order for an

application to be expeditiously restarted after a failure, it must be

“crash-tolerant”; that is, all processes in the package must be written so

that they can detect such a restart. This is the same application design

required for restart after a normal system crash.

In the event of a LAN interface failure, bonding provides a backup path

for IP messages. If a heartbeat LAN interface fails and no redundant

heartbeat is configured, the node fails with a TOC. If a monitored data

LAN interface fails without a standby, the node fails with a TOC only if

NODE_FAILFAST_ENABLED (described further in the “Planning” chapter

under “Package Configuration Planning”) is set to YES for the package.

Disk monitoring provides additional protection. You can configure

packages to be dependent on the health of disks, so that when a disk

monitor reports a problem, the package can fail over to another node.

Serviceguard does not respond directly to power failures, although a loss

of power to an individual cluster component may appear to Serviceguard

like the failure of that component, and will result in the appropriate

switching behavior. Power protection is provided by HP-supported

uninterruptible power supplies (UPS), such as HP PowerTrust.