Managing HP Serviceguard for Linux, Eighth Edition, March 2008

Understanding Serviceguard Software Components
Responses to Failures
Chapter 390
Responses to Hardware Failures
If a serious system problem occurs, such as a system panic or physical
disruption of the SPU's circuits, Serviceguard recognizes a node failure
and transfers the packages currently running on that node to an
adoptive node elsewhere in the cluster. (System multi-node and
multi-node packages do not fail over.)
The new location for each package is determined by that package's
configuration file, which lists primary and alternate nodes for the
package. Transfer of a package to another node does not transfer the
program counter. Processes in a transferred package will restart from
the beginning. In order for an application to be expeditiously restarted
after a failure, it must be “crash-tolerant”; that is, all processes in the
package must be written so that they can detect such a restart. This is
the same application design required for restart after a normal system
crash.
In the event of a LAN interface failure, bonding provides a backup path
for IP messages. If a heartbeat LAN interface fails and no redundant
heartbeat is configured, the node fails with a reboot. If a monitored data
LAN interface, the node fails with a reboot only if
node_fail_fast_enabled (described further under “Configuring a
Package: Next Steps” starting on page 139) is set to yes for the package.
Otherwise any packages using that LAN interface will be halted and
moved to another node if possible (unless the LAN recovers immediately;
see “When a Service or Subnet Fails, or a Dependency is Not Met” on
page 68).
Disk monitoring provides additional protection. You can configure
packages to be dependent on the health of disks, so that when a disk
monitor reports a problem, the package can fail over to another node. See
“Creating a Disk Monitor Configuration” on page 239.
Serviceguard does not respond directly to power failures, although a loss
of power to an individual cluster component may appear to Serviceguard
like the failure of that component, and will result in the appropriate
switching behavior. Power protection is provided by HP-supported
uninterruptible power supplies (UPS).