Managing HP Serviceguard A.11.20.20 for Linux, May 2013

3.8.1.1.1 Example
Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running
on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02is
exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB
respectively.
Failure. Only one LAN has been configured for both heartbeat and data traffic. During the course
of operations, heavy application traffic monopolizes the bandwidth of the network, preventing
heartbeat packets from getting through.
Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts to re-form
as a one-node cluster. Likewise, since SystemB does not receive heartbeat messages from
SystemA, SystemB also attempts to reform as a one-node cluster. During the election protocol,
each node votes for itself, giving both nodes 50 percent of the vote. Because both nodes have 50
percent of the vote, both nodes now vie for the cluster lock. Only one node will get the lock.
Outcome. Assume SystemA gets the cluster lock. SystemA re-forms as a one-node cluster. After
re-formation, SystemA will make sure all applications configured to run on an existing clustered
node are running. When SystemA discovers Package2 is not running in the cluster it will try to
start Package2 if Package2 is configured to run on SystemA.
SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the cluster. To
release all resources related toPackage2 (such as exclusive access to volume group vg02 and
the Package2 IP address) as quickly as possible, SystemB halts (system reset).
NOTE: If AUTOSTART_CMCLD in /etc/rc.config.d/cmcluster ($SGAUTOSTART) is set
to zero, the node will not attempt to join the cluster when it comes back up.
For more information on cluster failover, see the white paper Optimizing Failover Time in a
Serviceguard Environment (version A.11.19 or later) at http://www.hp.com/go/
linux-serviceguard-docs (Select “White Papers”). For troubleshooting information, see “Cluster
Re-formations Caused by MEMBER_TIMEOUT Being Set too Low” (page 258).
3.8.2 Responses to Hardware Failures
If a serious system problem occurs, such as a system panic or physical disruption of the SPU's
circuits, Serviceguard recognizes a node failure and transfers the packages currently running on
that node to an adoptive node elsewhere in the cluster. (System multi-node and multi-node packages
do not fail over.)
The new location for each package is determined by that package's configuration file, which lists
primary and alternate nodes for the package. Transfer of a package to another node does not
transfer the program counter. Processes in a transferred package will restart from the beginning.
In order for an application to be expeditiously restarted after a failure, it must be “crash-tolerant”;
that is, all processes in the package must be written so that they can detect such a restart. This is
the same application design required for restart after a normal system crash.
In the event of a LAN interface failure, bonding provides a backup path for IP messages. If a
heartbeat LAN interface fails and no redundant heartbeat is configured, the node fails with a
reboot. If a monitored data LAN interface fails, the node fails with a reboot only if
node_fail_fast_enabled (described further under “Configuring a Package: Next Steps
(page 132)) is set to yes for the package. Otherwise any packages using that LAN interface will
be halted and moved to another node if possible (unless the LAN recovers immediately; see “When
a Service or Subnet Fails or Generic Resource or a Dependency is Not Met” (page 59)).
Disk monitoring provides additional protection. You can configure packages to be dependent on
the health of disks, so that when a disk monitor reports a problem, the package can fail over to
another node. See “Creating a Disk Monitor Configuration” (page 198).
Serviceguard does not respond directly to power failures, although a loss of power to an individual
cluster component may appear to Serviceguard like the failure of that component, and will result
76 Understanding Serviceguard Software Components