Understanding and Designing Serviceguard Disaster Recovery Architectures

stress levels of the site administrator to restore the data center within a short time-frame
can increase the possibility of a human error in the restoration process.
Automated recovery procedures and processes can be transparent to the clients.
Even if recovery is automated, you may choose to, or need to recover from some types of
disasters with manual recovery. A rolling disaster, which is a disaster that occurs before the
cluster has recovered from a previous disaster, is an example of when you may want to
manually switch over. Assume that there was a data link failure initially. While the link was
coming up and resynchronizing data, a subsequent failure caused the data center to shut
down. In such cases, human intervention is easier to make judgment calls on which site had
the most current and consistent data before failing over.
Who manages the nodes in the cluster and how are they trained?
Setting up a disaster recovery architecture without planning for the people aspects is not cost
effective. Training and documentation are more complex because the cluster is in multiple
data centers.
Each data center often has its own workforce and its own sets of rules and policies. The people
working in these data centers communicate and coordinate with each other to maintain
failovers and also to recover from an actual disaster. If the remote nodes are placed in a
“lights-out” data center, the operations staff may want to add additional processes or monitoring
software to maintain the nodes in the remote location.
Rehearsals of failover scenarios are important to remain alert and prepared for an unforeseen
disaster. A written plan must outline rehearsal of what to do in cases of disaster with a minimum
recommended rehearsal schedule of once every six months, ideally once every three months.
How is the cluster maintained?
Planned downtime and maintenance, such as backups or upgrades, must be planned because
they may leave the cluster vulnerable to another failure. For example, in the Serviceguard
configurations discussed in Table 1, nodes must be brought down for maintenance in pairs:
one node at each site, so that quorum calculations do not prevent automated recovery if a
disaster occurs during planned maintenance.
Rapid detection of failures and rapid repair of hardware is essential so that the cluster is not
vulnerable to additional failures.
Testing is more complex and requires a dedicated personnel in each of the data centers. Site
failure testing must be added to the current cluster testing plans.
Managing a Disaster Recovery Environment 19