Serviceguard Extended Distance Cluster (EDC) with VxVM/CVM Mirroring on HP-UX, May 2008

Inter-site link failure

By extending a single Serviceguard cluster to span multiple data centers to protect against data center

outages, a new component is added to the cluster—an inter-site link. To keep the cluster from splitting

into halves, the inter-site link needs to be made redundant. This applies to the heartbeat network as

well as to the storage link between the sites.

If, for any reason, the redundant inter-site link fails, the cluster arbitrator—a quorum service at a 3rd

location—prevents a split-brain situation and allows only one of the two sites to reform the cluster and

run the applications. To have the best possible recovery from a complete inter-site link failure like this,

it is highly recommended to have a common redundant link for both the heartbeat and for the storage

traffic. This ensures that if both the primary and redundant inter-site links fail, the cluster heartbeat

traffic between the sites will be interrupted, causing the Serviceguard quorum mechanism to ensure

that the newly reformed cluster consists of nodes from a single site.

Without proper attention to detail, there is some risk that after an inter-site link failure, a node which

will be ejected from the cluster will cause the detach of a plex (located at the other site) from a volume

during the time when both sites are still up, but have lost communication.

To prevent this, the I/O timeout must not expire, before the cluster reformation has completed. This

can be achieved by setting the NODE_TIMEOUT cluster configuration parameter high enough to

allow the cluster reformation to complete before a potential I/O could time out. I/O timeout values for

VxVM/CVM volumes are configured at the DMP layer. The tunable pfto can be set with the command

vxpfto for a set of VxVM disks.

Currently DWDM, CWDM, SONET, and SDH technologies can be used for a common link for

heartbeat and storage traffic between the two sites of an EDC. Further support information can be

found in the “Understanding and Designing Serviceguard Disaster Tolerant Architectures” guide

referred to in the