Understanding and Designing Serviceguard Disaster Recovery Architectures

Disaster Recovery Cluster Limitations

Disaster recovery clusters have limitations, some of which can be mitigated by good planning.

Some examples of MPOF that may not be covered by disaster recovery configurations:

• Failure of all networks among all data centers — This can be mitigated by using a different

route for all network cables.

• Loss of power in more than one data center — This can be mitigated by ensuring that data

centers are on different power circuits, and redundant power supplies are on different circuits.

If power outages are frequent in your area, and down time is expensive, you can invest in a

backup generator.

• Loss of all copies of the online data — This can be mitigated by replicating data offline (frequent

backups). It can also be mitigated by taking snapshots of consistent data and storing it online;

Business Copy XP and EMC Symmetrix BCV (Business Consistency Volumes) provides this

functionality and the additional benefit of quick recovery in case both copies of online data

are corrupted.

• A rolling disaster is a disaster that occurs before the cluster is able to recover from a

non-disastrous failure. An example is a data replication link fails first, then as it is being

restored and data is being resynchronized, another disaster occurs that causes the entire data

center to fail. The effects of such rolling disasters can be mitigated by ensuring that a copy of

the data is stored either offline or on a separate disk that can be quickly mounted. The data

may not be current if this offline copy is used while restoring the data center.

Managing a Disaster Recovery Environment

In addition to the changes applied in hardware and software to create a disaster recovery

architecture, there are also changes in the way you manage the environment. Configuration of a

disaster recovery architecture needs to be carefully planned, implemented, and maintained. There

are additional resources needed, and additional decisions to be made for maintaining a disaster

recovery architecture:

• Manage it in-house, or hire a service?

Hiring a service can remove the burden of maintaining the capital equipment needed to

recover from a disaster. Most disaster recovery services provide their own off-site equipment,

which reduces maintenance costs. Often the disaster recovery site and equipment are shared

by many companies, further reducing cost.

Managing disaster recovery in-house gives complete control over the type of redundant

equipment used and the methods used to recover from disaster.

• Implement automated or manual recovery?

Implementation of manual recovery methods is more cost effective and gives more flexibility

in making decisions while recovering from a disaster. Evaluating the data and making decisions

can add to recovery time, but it is justified in some situations, for example, if applications

compete for resources following a disaster and one of them has to be halted.

Automated recovery reduces the amount of time and in most cases eliminates human intervention

needed to recover from a disaster. You may want to automate recovery for the following

reasons:

◦ Automated recovery is usually faster.

◦ Staff may not be available for manual recovery, as is the case with “lights-out” data

centers.

◦ Reduction in human intervention is also a reduction in human error. Disasters do not occur

frequently, so lack of practice may increase the potential for human error. The skyrocketing

18 Disaster Recovery in a Serviceguard Cluster