Designing Disaster Recovery Clusters using Metroclusters and Continentalclusters, Reprinted October 2011 (5900-1881)

ManualsBrandsHP ManualsSoftwareHP Serviceguard Metrocluster with EMC SRDF

471

472

473

474

475

476

477

478

479

480

the data transfer to the SVOL depends on the bandwidth of the Continuous Access links and

amount of outstanding data in the PVOL journal volume.

3. Failback - When systems recover from previous failures, a package can be failed back, within

Metrocluster data centers, by manually issuing a Serviceguard command. When a package

failback is triggered, the software needs to ensure the application data integrity. It executes

storage preparation actions to the P9000 or XP array, if it is necessary, prior to the package

startup.

NOTE: Do not use Serviceguard to failback from DC3. You need to take manual steps to replicate

data back from DC3. See “Failback Scenarios” (page 348).

In a three data center configuration whenever a package tries to start up a RAID Manager instance

on a host, that host communicates with other RAID Manager instances in different data centers. In

case the other data center RAID Manager instances are down it will wait for a time out value that

is configured in the RAID Manager configuration file for each data center. In this scenario to reduce

package startup time, set the instance timeout value under the HORCM_MON section of the RAID

Manager instance configuration file to a low, but safe value.

Bandwidth for Continuous Access and Application Recovery Time

When a disaster event in the entire Metrocluster causes an application package to be manually

failed over to the recovery site (the third data center), the Continentalclusters and storage software

performs the following actions:

• Perform a takeover by issuing a command to the third data center P9000 or XP array via RAID

Manager. This changes the P9000 or XP disk devices that are used by the application from

Read Only to Read/Write mode. If the PVOL site P9000 or XP array is still up, it will flush all

of the outstanding data in its journal volumes to the local P9000 or XP array as a part of the

takeover. Depending upon the bandwidth of the Continuous Access links and the amount of

outstanding data, the takeover operation may take some time. This time value is referred to

as TakeOverTime.

• Activates the volume group(s). The time for this is minimal; normally within a few seconds per

volume group.

• Check and mount any file systems if file systems are used. If Continuous Access data replication

has not failed, it should not take much time to check the file system. If Continuous Access data

replication did fail, it would require additional time to repair any file systems. This time value

is referred to as CheckandRepairTime.

• Add any package IP addresses. The time for this is minimal; normally within a second.

• Start the package application(s). If the application requires a database recovery, it may take

time before the application(s) is finally up and running. This time value is referred to as

AppRecoveryTime.

The total application recovery time is equal to TakeOverTime + CheckandRepairTime +

AppRecoveryTime.

During the planning phase for the cluster, the sizing of the link bandwidth for Continuous Access

should take the time value for TakeOverTime into consideration.

During the implementation phase for the cluster, tests should be executed to measure the total time,

TakeOverTime, it would take to failover including flush a full set of journal volumes from the PVOL

site P9000 or XP array to the SVOL site P9000 or XP array. The HORCTIMEOUT environment

variable in the package's environment file should be configured greater than or equal to this time

value. The HORCTIMEOUT value is used by the RAID Manager takeover command to determine

the maximum amount of time to allow for the takeover to complete.

472 Designing a Three Data Center Disaster Recovery Solution