Understanding and Designing Serviceguard Disaster Recovery Architectures

Disaster Recovery Cluster Limitations
Disaster recovery clusters have limitations, some of which can be mitigated by good planning.
Some examples of MPOF that may not be covered by disaster recovery configurations:
Failure of all networks among all data centers — This can be mitigated by using a different
route for all network cables.
Loss of power in more than one data center — This can be mitigated by ensuring that data
centers are on different power circuits, and redundant power supplies are on different circuits.
If power outages are frequent in your area, and down time is expensive, you can invest in a
backup generator.
Loss of all copies of the online data This can be mitigated by replicating data offline (frequent
backups). It can also be mitigated by taking snapshots of consistent data and storing it online;
Business Copy XP and EMC Symmetrix BCV (Business Consistency Volumes) provides this
functionality and the additional benefit of quick recovery in case both copies of online data
are corrupted.
A rolling disaster is a disaster that occurs before the cluster is able to recover from a
non-disastrous failure. An example is a data replication link fails first, then as it is being
restored and data is being resynchronized, another disaster occurs that causes the entire data
center to fail. The effects of such rolling disasters can be mitigated by ensuring that a copy of
the data is stored either offline or on a separate disk that can be quickly mounted. The data
may not be current if this offline copy is used while restoring the data center.
Managing a Disaster Recovery Environment
In addition to the changes applied in hardware and software to create a disaster recovery
architecture, there are also changes in the way you manage the environment. Configuration of a
disaster recovery architecture needs to be carefully planned, implemented, and maintained. There
are additional resources needed, and additional decisions to be made for maintaining a disaster
recovery architecture:
Manage it in-house, or hire a service?
Hiring a service can remove the burden of maintaining the capital equipment needed to
recover from a disaster. Most disaster recovery services provide their own off-site equipment,
which reduces maintenance costs. Often the disaster recovery site and equipment are shared
by many companies, further reducing cost.
Managing disaster recovery in-house gives complete control over the type of redundant
equipment used and the methods used to recover from disaster.
Implement automated or manual recovery?
Implementation of manual recovery methods is more cost effective and gives more flexibility
in making decisions while recovering from a disaster. Evaluating the data and making decisions
can add to recovery time, but it is justified in some situations, for example, if applications
compete for resources following a disaster and one of them has to be halted.
Automated recovery reduces the amount of time and in most cases eliminates human intervention
needed to recover from a disaster. You may want to automate recovery for the following
reasons:
Automated recovery is usually faster.
Staff may not be available for manual recovery, as is the case with “lights-out” data
centers.
Reduction in human intervention is also a reduction in human error. Disasters do not occur
frequently, so lack of practice may increase the potential for human error. The skyrocketing
18 Disaster Recovery in a Serviceguard Cluster