Designing Disaster Recovery Clusters using Metroclusters and Continentalclusters, Reprinted October 2011 (5900-1881)

the data transfer to the SVOL depends on the bandwidth of the Continuous Access links and
amount of outstanding data in the PVOL journal volume.
3. Failback - When systems recover from previous failures, a package can be failed back, within
Metrocluster data centers, by manually issuing a Serviceguard command. When a package
failback is triggered, the software needs to ensure the application data integrity. It executes
storage preparation actions to the P9000 or XP array, if it is necessary, prior to the package
startup.
NOTE: Do not use Serviceguard to failback from DC3. You need to take manual steps to replicate
data back from DC3. See “Failback Scenarios” (page 348).
In a three data center configuration whenever a package tries to start up a RAID Manager instance
on a host, that host communicates with other RAID Manager instances in different data centers. In
case the other data center RAID Manager instances are down it will wait for a time out value that
is configured in the RAID Manager configuration file for each data center. In this scenario to reduce
package startup time, set the instance timeout value under the HORCM_MON section of the RAID
Manager instance configuration file to a low, but safe value.
Bandwidth for Continuous Access and Application Recovery Time
When a disaster event in the entire Metrocluster causes an application package to be manually
failed over to the recovery site (the third data center), the Continentalclusters and storage software
performs the following actions:
Perform a takeover by issuing a command to the third data center P9000 or XP array via RAID
Manager. This changes the P9000 or XP disk devices that are used by the application from
Read Only to Read/Write mode. If the PVOL site P9000 or XP array is still up, it will flush all
of the outstanding data in its journal volumes to the local P9000 or XP array as a part of the
takeover. Depending upon the bandwidth of the Continuous Access links and the amount of
outstanding data, the takeover operation may take some time. This time value is referred to
as TakeOverTime.
Activates the volume group(s). The time for this is minimal; normally within a few seconds per
volume group.
Check and mount any file systems if file systems are used. If Continuous Access data replication
has not failed, it should not take much time to check the file system. If Continuous Access data
replication did fail, it would require additional time to repair any file systems. This time value
is referred to as CheckandRepairTime.
Add any package IP addresses. The time for this is minimal; normally within a second.
Start the package application(s). If the application requires a database recovery, it may take
time before the application(s) is finally up and running. This time value is referred to as
AppRecoveryTime.
The total application recovery time is equal to TakeOverTime + CheckandRepairTime +
AppRecoveryTime.
During the planning phase for the cluster, the sizing of the link bandwidth for Continuous Access
should take the time value for TakeOverTime into consideration.
During the implementation phase for the cluster, tests should be executed to measure the total time,
TakeOverTime, it would take to failover including flush a full set of journal volumes from the PVOL
site P9000 or XP array to the SVOL site P9000 or XP array. The HORCTIMEOUT environment
variable in the package's environment file should be configured greater than or equal to this time
value. The HORCTIMEOUT value is used by the RAID Manager takeover command to determine
the maximum amount of time to allow for the takeover to complete.
472 Designing a Three Data Center Disaster Recovery Solution