Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

5
It should be noted that before Serviceguard A.11.19, a failed node was allowed to rejoin the cluster
during cluster reformation time. But with Serviceguard A.11.19, a failed node is not allowed to rejoin
the cluster during reformation time. The failed node can rejoin the cluster only after rebooting.
Election of cluster membership
After a node believes another node has failed, it begins the cluster re-formation process. The cluster
nodes elect the nodes that will be members of a newly re-formed cluster.
Each healthy node tries to take over the work of the cluster. It tries to change cluster membership to
include the nodes it can communicate with and exclude the nodes it cannot reach.
If one node has failed but all the others can still communicate with each other, the others quickly
form a group that excludes that node. However, it could be that a group of healthy nodes cannot
communicate with some other healthy nodes. In this case, several groups could try to form a cluster.
The group that achieves quorum will become the new cluster. There are two ways for a group to
achieve quorum:
If the group includes more than half of the nodes that were active the last time the cluster was
formed, it has quorum because it has the majority.
If two groups each have exactly half of the nodes that were active the last time the cluster was
formed, the group that acquires the cluster lock achieves quorum. (See “Lock acquisition” for more
about the cluster lock.)
The group that achieves quorum takes over the work of the cluster. The excluded nodes are not
allowed to proceed in cluster re-formation, and they will be rebooted.
The amount of time taken for election depends on the value of MEMBER_TIMEOUT.
A particular special case should be noted where no Election of Cluster Membership is necessary:
when one node fails in a two-node cluster."
Lock acquisition
If two equal-sized groups try to re-form the cluster, the cluster lock acts as arbitrator or tie-breaker.
Whichever group acquires the cluster lock will achieve quorum and form the new cluster membership.
Serviceguard uses three types of cluster locks:
Quorum server (HP-UX 11I and Linux
®
)
LVM Lock disk (HP-UX 11I only)
Lock disk LUN (HP-UX 11I and Linux
®
)
A two-node cluster is required to have a cluster lock. In clusters of three or more nodes, a lock is
strongly recommended. The lock disks can be used for clusters with two, three, or four nodes. A
quorum server can be used on a cluster of any size.
Quiescence
Quiescence is a quiet waiting time after new cluster membership is determined. Nodes that are not in
the new membership are forcibly rebooted. The waiting time is a protection against data corruption.
Its purpose is to make sure that the reboot finishes so an excluded node is not trying to run a package
or issue any I/O.
Quiescence is important when some nodes in the cluster cannot communicate with the others but
could still run applications, particularly if the nodes have access to a common database.
Quiescence is calculated by Serviceguard, and the user cannot directly change it.