Serviceguard Extended Distance Cluster (EDC) with VxVM/CVM Mirroring on HP-UX, May 2008

Applications that run on nodes of the failing site (nodes that were halted by Serviceguard) are
expected to fail over automatically to the remaining site.
The applications that run at the remaining site are expected to keep running without manual
intervention or interruption. A short pause or hang might be experienced.
If, for any reason, a volume has no active, enabled plex at the remaining site, I/O operations to
that volume will be blocked.
VxVM Monitor Tip:
This (if a volume becomes unavailable) is a typical use case for the VxVM
Monitor. The VxVM monitor can be configured to fail a package if a
VxVM/CVM volume becomes unavailable. This monitor was introduced
with Serviceguard A.11.18 Patch PHSS_36997 or PHSS_36998. For
details please see HP Serviceguard Version A.11.18 Release Notes,
December 2007 referred to in the Related documents section.
Restore the Inter-Site Link and restart the detached site
The ISL failure led to one site being shut down in order to prevent both sites from operating
independently of each other. The recovery steps are very similar to the procedure for restoring a
failed site with one exception; the Inter-Site Link must be restored first. After the ISL is restored the
same basic steps as described above must be followed:
Start up all cluster components from the detached site
Validate connectivity (network and Fibre Channel) within the cluster
Make the LUNs visible to the OS
Make the LUNs known to DMP
Re-attach the LUNs which show NODEVICE in status field and resynchronize volumes
Have the failed nodes rejoin the cluster (cmrunnode)
Special considerations for Inter-Site Link failures
In a partial ISL failure situation, shared volumes can become disabled cluster-wide. A situation like this
could occur if the redundant storage ISL fails completely and independently from the heartbeat ISL. In
this case, the heartbeat communication continues between the two sites (not causing Serviceguard to
halt the nodes at one of the sites), but each node loses connectivity to the storage at the remote site.
For example, if a node in DC1 issues the first I/O after a complete storage ISL failure to volume1, it
will cause the plex located in DC2 that is associated with volume1 to be detached from volume1.
Then, if a node from DC2 tries to do I/O to volume1, it will cause a failure, since it cannot access the
remaining plex of volume1 located in DC1. This will cause the last remaining plex of volume1 to be
disabled, which will in turn disables the complete volume - preventing further I/O from any of the
cluster nodes to volume1. Since it is expected that all nodes do I/O to all volumes in a CVM/CFS
environment, it will only take a short time for all volumes to become disabled.
A situation like this will require a manual recovery. If the storage ISL is restored prior to the recovery,
all nodes may participate in the cluster. However, if the storage ISL remains unavailable, only nodes
from one site can participate in the cluster. Due to the variety of states the volumes could be left in, it
is not possible to give a generic recovery procedure. While not exhaustive, the following important
points should be considered:
A high level of familiarity with VxVM/CVM is required to perform a manual recovery.
The syslog file should be scrutinized to identify the last complete mirror (plex) for each volume:
“VxVM vxio V-5-0-9 vol1-02 is last complete mirror”
21