Clustering Linux Servers with the Concurrent Deployment of HP Serviceguard Linux and Red Hat Global File System for RHEL5, October 2008

In the concurrent deployment of Serviceguard and Red Hat GFS, a majority node failure of exactly

half the member can be sustained, provided, the qdisk is configured. In the event of majority node

failure of exactly half the member the qdisk vote will break the tie allowing the members to gain

quorum and proceed to form the cluster. Without a qdisk Red Hat cluster will not gain quorum

and hence disallow all further GFS operations requiring operator intervention. It is recommended

that in Serviceguard for Linux and Red Hat GFS, the qdisk is configured to act as tie breaker in

the event of a majority node failure of exactly half the nodes. This for instance will allow a 4 node

to continue operator in the event of 2 node failure.

Network partition in a 2-node cluster

As discussed earlier, in the event of a network partition in a 2-node Red Hat cluster, qdisk would

be required to prevent nodes from fencing each other, which is more likely to occur in a per-node

power management based fencing methods like HP iLO.

However in a concurrent deployment of Serviceguard for Linux and Red Hat cluster, the need for

qdisk does not arise. This is because Red Hat Cluster detects failure only after Serviceguard has

reset one of the nodes (one that has failed to get the lock via Quorum Server). Hence the

possibility of nodes fencing each other does not arise. For existing 2-node Red Hat cluster

configuration using qdisk, it is recommended to reconfigure the cluster without the qdisk when

deploying it with Serviceguard for Linux.

Unequal sized partitions

Network failure that creates “unequal sized partitions” in a concurrent deployment will result in

the same membership, for both Red Cluster and Serviceguard, provided that the members

contribute equally for the Red Hat Cluster Quorum. The larger partition gains quorum and forms

the cluster while nodes of the smaller partition are removed from the cluster. As discussed earlier,

it is possible to have“heavily-weighted voting” nodes in Red Hat Cluster where node do not

contribute equally for the quorum. Such asymmetric configurations will not be supported in the

concurrent deployment as it result in different membership for Red Hat and Serviceguard clusters;

bringing down the entire cluster.

Similarly in Red Hat Cluster, qdisk based heuristics can be used so that the smaller sized partition

gains quorum while the larger sized partition is removed from the cluster. Such configurations are

not in the concurrent deployment since it would result in different membership for Red Hat and

Serviceguard clusters; bringing down the entire cluster.

Equal sized partitions

To handle equal sized partition failures in Red Hat Cluster, qdisk is required to decide which

partition wins the quorum, so that cluster operations can continue without manual intervention. In

Serviceguard, the partition that gets the Quorum lock (via Quorum server) wins quorum, while the

nodes of other partition resets themselves. The partition would win the quorum in the Red Hat

cluster. In concurrent deployment of Serviceguard for Linux and Red Hat Cluster with qdisk, the

partition that wins quorum in Serviceguard cluster will also get quorum in Red Hat Cluster.

If a qdisk is not configured, then in the event of “equal sized partition” failure, neither of the

partitions gains the Red Hat Cluster quorum leaving the GFS in a suspended state until operator

intervenes. In Serviceguard, the partition that gets the cluster lock wins quorum and forms a cluster

and the nodes in the partition that failed to get the cluster lock reset themselves. But the

application failover (via package failover) will hang since the GFS is in a suspended state which

disallows all GFS operations. In such situations operator intervention is required to manually reset

the partitions nodes and restart the cluster.