HP StorageWorks XP Cluster Extension Software Administrator Guide (T1656-96035, April 2010)
Failover errors
XP Cluster Extension will fail to bring an RHCS service or SLE HA resource group online on the local
system if a configuration error occurs. In this case, XP Cluster Extension returns a local error.
The RHCS service or SLE HA resource group will go into a failed state after a startup attempt on any
system in the same data center if the disk array status indicates that a problem experienced locally
would not be solved on another system connected to the same disk array. In this case, XP Cluster
Extension returns a data center error. This error could also occur if the ApplicationStartup object is
set to FASTFAILBACK.
If a disk array state that does not allow starting the RHCS service or SLE HA resource group on any
system in the cluster is discovered, a cluster error is reported and none of the systems will be allowed
to run the service or resource group. Such a state could be an SMLP state on both primary and
secondary disks, a suspended (PSUS/SSUS) state on either site, or a state mismatch in the device
group for this RHCS service or SLE HA resource group. None of these scenarios allows automatic
recovery because XP Cluster Extension cannot determine which copy of the data is the most current.
In these cases, a storage or cluster administrator must investigate the problem.
CAUTION:
Do not start the RHCS service or SLE HA resource group again or try to start the failed RHCS service
or SLE HA resource group without investigating the problem. When an RHCS service or SLE HA
resource group using XP Cluster Extension fails, check the status of the XP disk pair on each copy and
decide whether it is safe to continue.
The FC link is down (RHCS)
In RHCS, the detection of a storage outage due to failure of all paths to the storage depends on the
monitoring capability of resources configured in the RHCS service. For example, the LVM and filesystem
resource agents distributed with RHCS can detect the loss of storage and take appropriate actions.
The stop operation on a service might fail due to the inability to stop individual resources cleanly.
This may be caused by the loss of paths to the storage. When the stop operation on a service fails,
RHCS marks the service as failed and the service does not automatically fail over to another node.
To recover from this situation, use the following procedure:
1. Remove the node that lost access to the storage by shutting down the node.
2. Follow the steps required to bring up a service in a failed state, as documented in the RHCS
administration guide. This process involves disabling the service, and then enabling it on the
node where the service is allowed to come online.
XP Cluster Extension Software Administrator Guide 157