Clustering Linux Servers with the Concurrent Deployment of HP Serviceguard Linux and Red Hat Global File System for RHEL5, October 2008

8
In the event of a failure in a concurrent deployment the following sequence of events occur:
1. Serviceguard detects the failure first, proceeds to resolve quorum, and removes the failed
nodes from cluster (by resetting them) before Red Hat cluster detects the failure.
2. While the failed node is booting up, the Red Hat Cluster detects failure, gains quorum
and request HP iLO to fence the failed nodes (which resets the node for the second time).
While the nodes are being fenced by HP iLO, GFS on all the nodes in the cluster would
be suspended.
3. Application startup (via Serviceguard package failover) on an alternate node will hang, if
GFS attempts to acquire new locks as part of the application startup, until the nodes are
successfully fenced. Once HP iLO completes fencing, GFS/DLM performs recovery;
resuming GFS operations.
Heartbeat timeout settings
For Serviceguard, it is recommended to use default heartbeat and the node timeout values. For
Red Hat cluster, it is recommended to delay the “failure detection” by increasing the totem token
(user configurable parameter in /etc/cluster/cluster.conf) value from the default 5 seconds to 28
seconds. In the co-existence Serviceguard detect and removes the failed node out (which resets
itself) of the cluster even before Red Hat Cluster detects that the node has failed.
The following XML totem token tag is set 28 seconds in /etc/cluster/cluster.conf as follows:
<totem token="28000"/> where x is the number of milliseconds. Default is 5000 (5 seconds).
This timeout specifies in milliseconds until a token loss is declared after not receiving a token.
This is the time spent detecting a failure of a processor in the current configuration.
GFS File System Freeze
Serviceguard package startup – either due to failover or a manual startup – during the period
when failure is detected to fencing is complete would hang, if, GFS attempts to acquire new locks
at the time of application startup (e.g., application started in the package control script attempts to
open and write to the same file which was locked by the failed node).
To avoid a package startup timeout, either set run_script_timeout to > 100 seconds or leave the
default setting of no timeout.
Majority node failures
In the concurrent deployment of Serviceguard and Red Hat GFS, with or without qdisk
configuration, a majority node failure of more than half the member will not be sustained i.e.,
results in the entire cluster going down. In Serviceguard for Linux, a majority node failure of more
than half the members will result in remaining nodes resetting themselves and bringing down the
complete cluster. Hence, Red Hat cluster configurations with quorum disk or “heavily weighted
voting” nodes (asymmetric configuration) cannot be used for sustaining majority node failure of
more than half the nodes when deployed with Serviceguard for Linux. It is important to note that
such a scenario would require multiple nodes failing “simultaneously” which is very rare. Also
care should be taken to ensure that the configurations do not contain the potential for a loss of
more than half of the membership from a single failure.
These include configurations with the majority of nodes as partitions within a single hardware
cabinet. This implies that when there are two cabinets, the partitions must be symmetrically
divided between the cabinets.