Building Disaster Recovery Serviceguard Solutions Using Metrocluster with 3PAR Remote Copy

the storage on a site, it sets the Site Safety Latch to a transient state, which is displayed as
INTERMEDIATE. When the Site Safety Latch is in the INTERMEDIATE state, the corresponding Site
Controller package can be restarted only after cleaning the site where it previously failed to start.
For more information on cleaning the Site Controller package, see “Cleaning the site to restart the
Site Controller package” (page 61).
Node failure and rejoining the cluster
When a node in a cluster fails, all Multi-node packages (MNP) instances running on the failed
node also fails. The failover type packages failover to the next available adoptive node. If no other
adoptive node is configured and available in the cluster, the failover package fails and is halted.
When a node in the Metrocluster environment is restarted, the active complex-workload packages
on the node are halted before the node restarts. Once the node is restarted and joins a cluster,
the active complex-workload package instances on the site with the auto_run flag set to yes
automatically start. If the complex workload's packages have the auto_run flag set to no, these
instances must be manually started on the restarted node.
When a node, on which the Site Controller package is running, is restarted, the Site Controller
package fails over to the next available adoptive node. Based on the site adoptive node that the
Site Controller package is started on and the status of the active complex-workloads packages,
the Site Controller package performs a site failover, if necessary.
Network partitions across sites
A network partition across sites is similar to a site failure. The Serviceguard cluster nodes on both
sites detect this failure and try to reform the cluster using the Quorum Server. The nodes from only
one of the sites will receive the quorum and form the cluster. The nodes on the other site restart
and deliberately fail the active complex-workload packages running on them.
The Site Controller package running on the site nodes that failed to form the cluster will now fail
over to the adoptive node on the site where the cluster is reformed. When the Site Controller
Package starts on the adoptive node at the remote site, it detects that the active complex workload's
packages have failed. Consequently, the Site Controller package performs a site failover and starts
the corresponding complex workload's packages on the site where the cluster has reformed.
Disk array and SAN failure
When a disk array or the host access SAN at a site fails, the active complex workload database
running on the site could hang or fail based on the component that has failed. If the SAN failure
causes the complex workload database processes to fail and consequently the complex-workload
packages also fail, the Site Controller Package initiates a site failover.
Replication link failure
A failure in a replication link between sites stall the replication from the active complex-workload
package configuration to the remote site. The impact of a replication link failure on the running
complex-workload packages is based on the configured replication mode.
On a synchronized replication mode, with fence level set to Data, the primary site disk array starts
failing I/Os. This causes the active complex workload configuration to fail. The Site Controller
package then performs a site failover, if a complex-workload package is configured as a
critical_package.
If the fence level is set to Never, the I/O on the PVOL side is not failed, and the active complex
workload continues to run successfully.
On an asynchronous periodic mode replication, there is no interruption at the complex workload's
configuration and it continues to run uninterrupted.
Failure scenarios in a complex workload 47