Building Disaster Recovery Serviceguard Solutions Using Metrocluster with 3PAR Remote Copy

Site failover
When the Site Controller package determines that a running package configuration of a disaster
tolerant complex workload has failed in the Metrocluster, or that the site hosting has failed, it fails
over to the remote site node and initiates a site failover from the remote node. The site failover
starts the adoptive complex-workload package configuration by starting the packages configured
on the remote site.
The Site Controller package monitors the active complex-workload packages according to the
configuration, to detect a failure and initiate a site failover. When the complex-workload packages
are configured using the critical_package attribute, the Site Controller package detects and
initiates a site failover even if one of the critical packages fail. In a configuration where all the
packages in the complex workload are configured with the managed_package attribute, the Site
Controller package detects a failure and initiates site failover based on the cumulative status of all
the configured managed packages.
A complex-workload package that has failed or is halted, in addition to displaying a down state,
also displays a halted status. A special flag, package_halted is set to no when the
complex-workload package is down, having failed in the cluster. This special flag is set to yes
when the complex-workload package is down and manually halted. Serviceguard sets this flag to
no only when the last surviving instance of the complex workload package is halted as a result of
a failure. The flag is set to yes if the last surviving instance is manually halted, even if other instances
are halted earlier due to failures.
The Site Controller package determines a failure by checking if the package_halted flag is set
to no for all monitored packages that are in the down state. When the monitored packages have
failed but not halted, the Site Controller Package fails over to a remote site node to perform a site
failover.
Before starting the complex-workload packages configured at the remote site, the Site Controller
package ensures that it is safe to do so. The failed complex-workload packages might not have
halted cleanly, leaving stray processes and resources. In such scenarios, it is not safe to start the
identical complex workload configuration on the remote site. As a result, when it starts on the
remote site node, the Site Controller package checks whether all instances of the failed active
packages have halted cleanly. The Site Controller Package checks the last_halt_failed flag
for each instance of the workload packages. The flag is set to yes for an instance whose halt script
execution resulted in an error. Even if one instance of any of the failed workload's packages did
not halt successfully, the Site Controller package aborts site failover. In these circumstances, the
Site Controller package halts and its state is displayed as failed on the remote site node. To restart
the Site Controller package and the complex workload configuration, the nodes on the site need
to be manually cleaned.
After ensuring a clean halt for all instances of the failed complex-workload packages, the Site
Controller package performs the following steps to activate the corresponding passive complex
workload configured in its current site:
1. Closes the Site Safety Latch for the failed complex-workload package nodes.
2. Waits for all configured packages as part of the failed complex-workload package to halt
successfully.
3. Deports the CVM disk groups used by the database on the failed site.
4. Prepares the replicated data storage on the current site using the Metrocluster environment
file on the node it is starting.
5. Imports the CVM disk groups used by the database in the current site.
6. Opens the Site Safety Latch in the current site.
7. Starts the complex-workload packages configured for the database in the current site.
For the Site Controller package to successfully start the remote complex-workload package
configuration, the packages in the remote configuration must have node switching enabled on
their configured nodes. When the Site Controller package fails to start after successfully preparing
46 Understanding failover/failback scenarios