Clustering Linux Servers with the Concurrent Deployment of HP Serviceguard Linux and Red Hat Global File System for RHEL5, October 2008

Fencing protects data integrity by preventing the failed node from writing to a shared storage.

Red Hat Cluster supports various mechanisms, but the only one supported in conjunction with

Serviceguard is Integrated Lights Out (iLO) fencing. Using that mechanism, a message is sent to

the iLO of a server to restart that server. Use of iLO is less costly and easier to manage than most

other methods.

In the event of “unequal sized partition” (i.e., network failure that creates partitions with different

number of members) network partition in Red Hat Cluster, the partition with the majority number of

votes has quorum, forms a new cluster. The failed node/s (or the partition that lost quorum) is

“fenced” by the quorate partition i.e., removed from the cluster.

2-node cluster

Red Hat Cluster allows creation of two node cluster with an exception to the quorum rule (i.e.,

majority of votes are required for quorum), in that, one node is considered enough to establish a

quorum. This exception is enabled for a 2 node cluster via the special two_node="1" setting in

the cluster configuration file. In case of node failure, the surviving node would fence the failed

node and proceed to form a single node cluster. In case of network partition, each node, which

has quorum (based on exception to the quorum rule), will attempt to fence each other and the

quickest (first to fence the other) wins.

Note that, in per-node power management (i.e., where the device is not shared between cluster

nodes) used in HP iLO, it is possible for both nodes to simultaneously fencing each other, bringing

down the entire cluster.

Equal sized partitions

In the event of an “equal sized partition” (where partitions are created with, same number of

members) neither of partitions gain quorum. In Red Hat Cluster, unless quorum is gained, neither

partition is allowed to fence the other from the cluster; this results in freezing of all cluster

operations (i.e., preventing application availability). In such cases operator intervention is

required to manually reset the both partitions nodes and restart the cluster.

Majority Node failures

Similarly, in the event of a “majority node failure” (i.e., losing enough nodes to break quorum)

results in loss of quorum resulting is suspension of all cluster operations preventing application

startups. In Red Hat cluster losing exactly half or more members results is refers to majority node

failure resulting in loss of quorum. In such cases operator intervention is required to manually reset

the surviving nodes and restart the cluster.

Quorum disk to bolster quorum

The quorum disk (qdisk) was re-introduced in Red Hat Cluster to bolster the existing quorum

mechanism without introducing asymmetric cluster configurations using “heavily-weighted voting”

nodes.

The qdisk with properly configured heuristics, to address the following:

1. In the event of network partition in a 2-node cluster, it is used to decide which member is

allowed to win, and hence preventing nodes from simultaneously fencing each other

resulting in bringing down the entire cluster.

2. Allow continued cluster operations even after a majority node failure without the need for

manual intervention.

3. In equal sized partition it is used to decide which partition wins quorum and hence

allowing cluster operations to continue without the need for manual intervention.