Managing HP Serviceguard A.11.20.10 for Linux, December 2012

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux RH AS ProLiant Cluster

SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the cluster. To

release all resources related toPackage2 (such as exclusive access to volume group vg02 and

the Package2 IP address) as quickly as possible, SystemB halts (system reset).

NOTE: If AUTOSTART_CMCLD in /etc/rc.config.d/cmcluster ($SGAUTOSTART) is set

to zero, the node will not attempt to join the cluster when it comes back up.

For more information on cluster failover, see the white paper Optimizing Failover Time in a

Serviceguard Environment (version A.11.19 or later) at http://www.hp.com/go/

linux-serviceguard-docs (Select “White Papers”). For troubleshooting information, see “Cluster

Re-formations Caused by MEMBER_TIMEOUT Being Set too Low” (page 250).

3.8.2 Responses to Hardware Failures

If a serious system problem occurs, such as a system panic or physical disruption of the SPU's

circuits, Serviceguard recognizes a node failure and transfers the packages currently running on

that node to an adoptive node elsewhere in the cluster. (System multi-node and multi-node packages

do not fail over.)

The new location for each package is determined by that package's configuration file, which lists

primary and alternate nodes for the package. Transfer of a package to another node does not

transfer the program counter. Processes in a transferred package will restart from the beginning.

In order for an application to be expeditiously restarted after a failure, it must be “crash-tolerant”;

that is, all processes in the package must be written so that they can detect such a restart. This is

the same application design required for restart after a normal system crash.

In the event of a LAN interface failure, bonding provides a backup path for IP messages. If a

heartbeat LAN interface fails and no redundant heartbeat is configured, the node fails with a

reboot. If a monitored data LAN interface fails, the node fails with a reboot only if

node_fail_fast_enabled (described further under “Configuring a Package: Next Steps”

(page 127)) is set to yes for the package. Otherwise any packages using that LAN interface will

be halted and moved to another node if possible (unless the LAN recovers immediately; see “When

a Service or Subnet Fails or Generic Resource or a Dependency is Not Met” (page 56)).

Disk monitoring provides additional protection. You can configure packages to be dependent on

the health of disks, so that when a disk monitor reports a problem, the package can fail over to

another node. See “Creating a Disk Monitor Configuration” (page 191).

Serviceguard does not respond directly to power failures, although a loss of power to an individual

cluster component may appear to Serviceguard like the failure of that component, and will result

in the appropriate switching behavior. Power protection is provided by HP-supported uninterruptible

power supplies (UPS).

3.8.3 Responses to Package and Service Failures

In the default case, the failure of a package, a generic resource or service of the package or of a

service within a package causes the package to shut down by running the control script with the

stop parameter, and then restarting the package on an alternate node. A package will also fail

if it is configured to have a dependency on another package, and that package fails.

You can modify this default behavior by specifying that the node should halt (system reset) before

the transfer takes place. You do this by setting failfast parameters in the package configuration

file.

In cases in which package shutdown might hang, leaving the node in an unknown state, failfast

options can provide a quick failover, after which the node will be cleaned up on reboot. Remember,

however, that a system reset causes all packages on the node to halt abruptly.

The settings of the failfast parameters in the package configuration file determine the behavior of

the package and the node in the event of a package or resource failure:

3.8 Responses to Failures 73