Building Disaster Recovery Serviceguard Solutions Using Metrocluster with 3PAR Remote Copy

the storage on a site, it sets the Site Safety Latch to a transient state, which is displayed as

INTERMEDIATE. When the Site Safety Latch is in the INTERMEDIATE state, the corresponding Site

Controller package can be restarted only after cleaning the site where it previously failed to start.

For more information on cleaning the Site Controller package, see “Cleaning the site to restart the

Site Controller package” (page 61).

Node failure and rejoining the cluster

When a node in a cluster fails, all Multi-node packages (MNP) instances running on the failed

node also fails. The failover type packages failover to the next available adoptive node. If no other

adoptive node is configured and available in the cluster, the failover package fails and is halted.

When a node in the Metrocluster environment is restarted, the active complex-workload packages

on the node are halted before the node restarts. Once the node is restarted and joins a cluster,

the active complex-workload package instances on the site with the auto_run flag set to yes

automatically start. If the complex workload's packages have the auto_run flag set to no, these

instances must be manually started on the restarted node.

When a node, on which the Site Controller package is running, is restarted, the Site Controller

package fails over to the next available adoptive node. Based on the site adoptive node that the

Site Controller package is started on and the status of the active complex-workloads packages,

the Site Controller package performs a site failover, if necessary.

Network partitions across sites

A network partition across sites is similar to a site failure. The Serviceguard cluster nodes on both

sites detect this failure and try to reform the cluster using the Quorum Server. The nodes from only

one of the sites will receive the quorum and form the cluster. The nodes on the other site restart

and deliberately fail the active complex-workload packages running on them.

The Site Controller package running on the site nodes that failed to form the cluster will now fail

over to the adoptive node on the site where the cluster is reformed. When the Site Controller

Package starts on the adoptive node at the remote site, it detects that the active complex workload's

packages have failed. Consequently, the Site Controller package performs a site failover and starts

the corresponding complex workload's packages on the site where the cluster has reformed.

Disk array and SAN failure

When a disk array or the host access SAN at a site fails, the active complex workload database

running on the site could hang or fail based on the component that has failed. If the SAN failure

causes the complex workload database processes to fail and consequently the complex-workload

packages also fail, the Site Controller Package initiates a site failover.

Replication link failure

A failure in a replication link between sites stall the replication from the active complex-workload

package configuration to the remote site. The impact of a replication link failure on the running

complex-workload packages is based on the configured replication mode.

On a synchronized replication mode, with fence level set to Data, the primary site disk array starts

failing I/Os. This causes the active complex workload configuration to fail. The Site Controller

package then performs a site failover, if a complex-workload package is configured as a

critical_package.

If the fence level is set to Never, the I/O on the PVOL side is not failed, and the active complex

workload continues to run successfully.

On an asynchronous periodic mode replication, there is no interruption at the complex workload's

configuration and it continues to run uninterrupted.

Failure scenarios in a complex workload 47