Clustering Linux Servers with the Concurrent Deployment of HP Serviceguard Linux and Red Hat Global File System for RHEL5, October 2008

9
In the concurrent deployment of Serviceguard and Red Hat GFS, a majority node failure of exactly
half the member can be sustained, provided, the qdisk is configured. In the event of majority node
failure of exactly half the member the qdisk vote will break the tie allowing the members to gain
quorum and proceed to form the cluster. Without a qdisk Red Hat cluster will not gain quorum
and hence disallow all further GFS operations requiring operator intervention. It is recommended
that in Serviceguard for Linux and Red Hat GFS, the qdisk is configured to act as tie breaker in
the event of a majority node failure of exactly half the nodes. This for instance will allow a 4 node
to continue operator in the event of 2 node failure.
Network partition in a 2-node cluster
As discussed earlier, in the event of a network partition in a 2-node Red Hat cluster, qdisk would
be required to prevent nodes from fencing each other, which is more likely to occur in a per-node
power management based fencing methods like HP iLO.
However in a concurrent deployment of Serviceguard for Linux and Red Hat cluster, the need for
qdisk does not arise. This is because Red Hat Cluster detects failure only after Serviceguard has
reset one of the nodes (one that has failed to get the lock via Quorum Server). Hence the
possibility of nodes fencing each other does not arise. For existing 2-node Red Hat cluster
configuration using qdisk, it is recommended to reconfigure the cluster without the qdisk when
deploying it with Serviceguard for Linux.
Unequal sized partitions
Network failure that creates “unequal sized partitions” in a concurrent deployment will result in
the same membership, for both Red Cluster and Serviceguard, provided that the members
contribute equally for the Red Hat Cluster Quorum. The larger partition gains quorum and forms
the cluster while nodes of the smaller partition are removed from the cluster. As discussed earlier,
it is possible to have“heavily-weighted voting” nodes in Red Hat Cluster where node do not
contribute equally for the quorum. Such asymmetric configurations will not be supported in the
concurrent deployment as it result in different membership for Red Hat and Serviceguard clusters;
bringing down the entire cluster.
Similarly in Red Hat Cluster, qdisk based heuristics can be used so that the smaller sized partition
gains quorum while the larger sized partition is removed from the cluster. Such configurations are
not in the concurrent deployment since it would result in different membership for Red Hat and
Serviceguard clusters; bringing down the entire cluster.
Equal sized partitions
To handle equal sized partition failures in Red Hat Cluster, qdisk is required to decide which
partition wins the quorum, so that cluster operations can continue without manual intervention. In
Serviceguard, the partition that gets the Quorum lock (via Quorum server) wins quorum, while the
nodes of other partition resets themselves. The partition would win the quorum in the Red Hat
cluster. In concurrent deployment of Serviceguard for Linux and Red Hat Cluster with qdisk, the
partition that wins quorum in Serviceguard cluster will also get quorum in Red Hat Cluster.
If a qdisk is not configured, then in the event of “equal sized partition” failure, neither of the
partitions gains the Red Hat Cluster quorum leaving the GFS in a suspended state until operator
intervenes. In Serviceguard, the partition that gets the cluster lock wins quorum and forms a cluster
and the nodes in the partition that failed to get the cluster lock reset themselves. But the
application failover (via package failover) will hang since the GFS is in a suspended state which
disallows all GFS operations. In such situations operator intervention is required to manually reset
the partitions nodes and restart the cluster.