Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Understanding Serviceguard Software Components
Responses to Failures
Chapter 382
Responses to Hardware Failures
If a serious system problem occurs, such as a system panic or physical
disruption of the SPU's circuits, Serviceguard recognizes a node failure
and transfers the packages currently running on that node to an
adoptive node elsewhere in the cluster. The new location for each
package is determined by that package's configuration file, which lists
primary and alternate nodes for the package. Transfer of a package to
another node does not transfer the program counter. Processes in a
transferred package will restart from the beginning. In order for an
application to be expeditiously restarted after a failure, it must be
“crash-tolerant”; that is, all processes in the package must be written so
that they can detect such a restart. This is the same application design
required for restart after a normal system crash.
In the event of a LAN interface failure, bonding provides a backup path
for IP messages. If a heartbeat LAN interface fails and no redundant
heartbeat is configured, the node fails with a TOC. If a monitored data
LAN interface fails without a standby, the node fails with a TOC only if
NODE_FAILFAST_ENABLED (described further in the “Planning” chapter
under “Package Configuration Planning”) is set to YES for the package.
Disk monitoring provides additional protection. You can configure
packages to be dependent on the health of disks, so that when a disk
monitor reports a problem, the package can fail over to another node.
Serviceguard does not respond directly to power failures, although a loss
of power to an individual cluster component may appear to Serviceguard
like the failure of that component, and will result in the appropriate
switching behavior. Power protection is provided by HP-supported
uninterruptible power supplies (UPS), such as HP PowerTrust.