Clustering Linux Servers with the Concurrent Deployment of HP Serviceguard Linux and Red Hat Global File System for RHEL5, October 2008

5
Fencing protects data integrity by preventing the failed node from writing to a shared storage.
Red Hat Cluster supports various mechanisms, but the only one supported in conjunction with
Serviceguard is Integrated Lights Out (iLO) fencing. Using that mechanism, a message is sent to
the iLO of a server to restart that server. Use of iLO is less costly and easier to manage than most
other methods.
In the event of “unequal sized partition” (i.e., network failure that creates partitions with different
number of members) network partition in Red Hat Cluster, the partition with the majority number of
votes has quorum, forms a new cluster. The failed node/s (or the partition that lost quorum) is
“fenced” by the quorate partition i.e., removed from the cluster.
2-node cluster
Red Hat Cluster allows creation of two node cluster with an exception to the quorum rule (i.e.,
majority of votes are required for quorum), in that, one node is considered enough to establish a
quorum. This exception is enabled for a 2 node cluster via the special two_node="1" setting in
the cluster configuration file. In case of node failure, the surviving node would fence the failed
node and proceed to form a single node cluster. In case of network partition, each node, which
has quorum (based on exception to the quorum rule), will attempt to fence each other and the
quickest (first to fence the other) wins.
Note that, in per-node power management (i.e., where the device is not shared between cluster
nodes) used in HP iLO, it is possible for both nodes to simultaneously fencing each other, bringing
down the entire cluster.
Equal sized partitions
In the event of an “equal sized partition” (where partitions are created with, same number of
members) neither of partitions gain quorum. In Red Hat Cluster, unless quorum is gained, neither
partition is allowed to fence the other from the cluster; this results in freezing of all cluster
operations (i.e., preventing application availability). In such cases operator intervention is
required to manually reset the both partitions nodes and restart the cluster.
Majority Node failures
Similarly, in the event of a “majority node failure” (i.e., losing enough nodes to break quorum)
results in loss of quorum resulting is suspension of all cluster operations preventing application
startups. In Red Hat cluster losing exactly half or more members results is refers to majority node
failure resulting in loss of quorum. In such cases operator intervention is required to manually reset
the surviving nodes and restart the cluster.
Quorum disk to bolster quorum
The quorum disk (qdisk) was re-introduced in Red Hat Cluster to bolster the existing quorum
mechanism without introducing asymmetric cluster configurations using “heavily-weighted voting”
nodes.
The qdisk with properly configured heuristics, to address the following:
1. In the event of network partition in a 2-node cluster, it is used to decide which member is
allowed to win, and hence preventing nodes from simultaneously fencing each other
resulting in bringing down the entire cluster.
2. Allow continued cluster operations even after a majority node failure without the need for
manual intervention.
3. In equal sized partition it is used to decide which partition wins quorum and hence
allowing cluster operations to continue without the need for manual intervention.