Understanding and Designing Serviceguard Disaster Recovery Architectures

stress levels of the site administrator to restore the data center within a short time-frame

can increase the possibility of a human error in the restoration process.

◦ Automated recovery procedures and processes can be transparent to the clients.

Even if recovery is automated, you may choose to, or need to recover from some types of

disasters with manual recovery. A rolling disaster, which is a disaster that occurs before the

cluster has recovered from a previous disaster, is an example of when you may want to

manually switch over. Assume that there was a data link failure initially. While the link was

coming up and resynchronizing data, a subsequent failure caused the data center to shut

down. In such cases, human intervention is easier to make judgment calls on which site had

the most current and consistent data before failing over.

• Who manages the nodes in the cluster and how are they trained?

Setting up a disaster recovery architecture without planning for the people aspects is not cost

effective. Training and documentation are more complex because the cluster is in multiple

data centers.

Each data center often has its own workforce and its own sets of rules and policies. The people

working in these data centers communicate and coordinate with each other to maintain

failovers and also to recover from an actual disaster. If the remote nodes are placed in a

“lights-out” data center, the operations staff may want to add additional processes or monitoring

software to maintain the nodes in the remote location.

Rehearsals of failover scenarios are important to remain alert and prepared for an unforeseen

disaster. A written plan must outline rehearsal of what to do in cases of disaster with a minimum

recommended rehearsal schedule of once every six months, ideally once every three months.

• How is the cluster maintained?

Planned downtime and maintenance, such as backups or upgrades, must be planned because

they may leave the cluster vulnerable to another failure. For example, in the Serviceguard

configurations discussed in Table 1, nodes must be brought down for maintenance in pairs:

one node at each site, so that quorum calculations do not prevent automated recovery if a

disaster occurs during planned maintenance.

Rapid detection of failures and rapid repair of hardware is essential so that the cluster is not

vulnerable to additional failures.

Testing is more complex and requires a dedicated personnel in each of the data centers. Site

failure testing must be added to the current cluster testing plans.

Managing a Disaster Recovery Environment 19