Building Disaster Recovery Serviceguard Solutions Using Metrocluster with Continuous Access EVA A.05.01

When a node, on which the Site Controller package is running, is restarted, the Site Controller
package fails over to the next available adoptive node. Based on the site adoptive node that the
Site Controller package is started on and the status of the active complex-workloads packages,
the Site Controller package performs a site failover, if necessary.
Network partitions across sites
A network partition across sites is similar to a site failure. The Serviceguard cluster nodes on both
sites detect this failure and try to reform the cluster using the Quorum Server. The nodes from only
one of the sites will receive the quorum and form the cluster. The nodes on the other site restart
and deliberately fail the active complex-workload packages running on them.
The Site Controller package running on the site nodes that failed to form the cluster will now fail
over to the adoptive node on the site where the cluster is reformed. When the Site Controller
Package starts on the adoptive node at the remote site, it detects that the active complex workload's
packages have failed. Consequently, the Site Controller package performs a site failover and starts
the corresponding complex workload's packages on the site where the cluster has reformed.
Disk array and SAN failure
When a disk array or the host access SAN at a site fails, the active complex workload database
running on the site might hang or fail based on the component that has failed. If the SAN failure
causes the complex workload database processes to fail and consequently the complex-workload
packages also fail, the Site Controller Package initiates a site failover.
Replication link failure
A failure in a replication link between sites stalls the replication from the active complex-workload
package configuration to the remote site.
If failsafe mode is disabled when all Continuous Access links fail, then the active complex workload
continues to run uninterrupted and writes new I/O to source Vdisk. But if failsafe mode is enabled,
then the primary site disk array starts failing I/Os. This causes the active complex workload
configuration to fail. The Site Controller package then performs a site failover, if a complex-workload
package is configured as a critical_package.
Site Controller package failure
The Site Controller package can fail for many reasons, such as node crash, while the active
complex-workload package stack on the site is up and running. The Site Controller package fails
over to an adoptive node, which can be a node on the same site or a node on the remote site.
The Site Controller package behavior is different under each scenario so that the complex workload
availability is not disrupted.
NOTE: When the adoptive node is a node in the same site, where the current active complex
workload stack is running, it is considered as a local failover for the Site Controller package.
On a Site Controller package local failover, the disaster tolerant complex workload remains
uninterrupted on that site. The Site Controller package continues to monitor the managed packages
or the critical packages on the site, as configured from the current node.
When the Site Controller package fails over across sites, the Site Controller package fails if the
active complex workload package stack is running on the other site. The complex workload
configuration continues to be available in the cluster. However, as the Site Controller package has
failed in the cluster, the complex workload configuration can no longer automatically failover to
the remote site.
58 Understanding failover/failback scenarios