HP StorageWorks XP Cluster Extension Software Administrator Guide (T1656-96035, April 2010)

Failover errors

XP Cluster Extension will fail to bring an RHCS service or SLE HA resource group online on the local

system if a configuration error occurs. In this case, XP Cluster Extension returns a local error.

The RHCS service or SLE HA resource group will go into a failed state after a startup attempt on any

system in the same data center if the disk array status indicates that a problem experienced locally

would not be solved on another system connected to the same disk array. In this case, XP Cluster

Extension returns a data center error. This error could also occur if the ApplicationStartup object is

set to FASTFAILBACK.

If a disk array state that does not allow starting the RHCS service or SLE HA resource group on any

system in the cluster is discovered, a cluster error is reported and none of the systems will be allowed

to run the service or resource group. Such a state could be an SMLP state on both primary and

secondary disks, a suspended (PSUS/SSUS) state on either site, or a state mismatch in the device

group for this RHCS service or SLE HA resource group. None of these scenarios allows automatic

recovery because XP Cluster Extension cannot determine which copy of the data is the most current.

In these cases, a storage or cluster administrator must investigate the problem.

CAUTION:

Do not start the RHCS service or SLE HA resource group again or try to start the failed RHCS service

or SLE HA resource group without investigating the problem. When an RHCS service or SLE HA

resource group using XP Cluster Extension fails, check the status of the XP disk pair on each copy and

decide whether it is safe to continue.

The FC link is down (RHCS)

In RHCS, the detection of a storage outage due to failure of all paths to the storage depends on the

monitoring capability of resources configured in the RHCS service. For example, the LVM and filesystem

resource agents distributed with RHCS can detect the loss of storage and take appropriate actions.

The stop operation on a service might fail due to the inability to stop individual resources cleanly.

This may be caused by the loss of paths to the storage. When the stop operation on a service fails,

RHCS marks the service as failed and the service does not automatically fail over to another node.

To recover from this situation, use the following procedure:

1. Remove the node that lost access to the storage by shutting down the node.

2. Follow the steps required to bring up a service in a failed state, as documented in the RHCS

administration guide. This process involves disabling the service, and then enabling it on the

node where the service is allowed to come online.

XP Cluster Extension Software Administrator Guide 157