Designing Disaster Tolerant High Availability Clusters, 10th Edition, March 2003 (B7660-90013)

Disaster Tolerance and Recovery in an MC/ServiceGuard Cluster

Managing a Disaster Tolerant Environment

Chapter 1 45

Even if recovery is automated, you many choose to, or need to recover

from some types of disasters with manual recovery. A rolling

disaster, which is a disaster that happens before the cluster has

recovered from a previous disaster, is an example of when you may

want to manually switch over. If the data link failed, and as it was

coming up and resynchronizing data, a data center failed, you would

want human intervention to make judgment calls on which site had

the most current and consistent data before failing over.

• Who manages the nodes in the cluster and how are they trained?

Putting a disaster tolerant architecture in place without planning for

the people aspects is a waste of money. Training and documentation

are more complex because the cluster is in multiple data centers.

Each data center often has its own operations staff with their own

processes and ways of working. These operations people will now be

required to communicate with each other and coordinate

maintenance and failover rehearsals, as well as working together to

recover from an actual disaster. If the remote nodes are placed in a

“lights-out” data center, the operations staff may want to put

additional processes or monitoring software in place to maintain the

nodes in the remote location.

Rehearsals of failover scenarios are important to keep everyone

prepared. A written plan should outline rehearsal of what to do in

cases of disaster with a minimum recommended rehearsal schedule

of once every 6 months, ideally once every 3 months.

• How is the cluster maintained?

Planned downtime and maintenance, such as backups or upgrades,

must be more carefully thought out because they may leave the

cluster vulnerable to another failure. For example, in the

MC/ServiceGuard configurations discussed in Chapter 2, nodes need

to be brought down for maintenance in pairs: one node at each site,

so that quorum calculations do not prevent automated recovery if a

disaster occurs during planned maintenance.

Rapid detection of failures and rapid repair of hardware is essential

so that the cluster is not vulnerable to additional failures.

Testing is more complex and requires personnel in each of the data

centers. Site failure testing should be added to the current cluster

testing plans.