Designing Disaster Tolerant High Availability Clusters, 10th Edition, March 2003 (B7660-90013)

Building a Continental Cluster

Understanding Continental Cluster Concepts

Chapter 5 187

The expected process in dealing with alerts is to continue watching for

additional notifications and to contact individuals at the site of the

monitored cluster to see whether problems exist.

Alarms

Alarms are intended to indicate that a cluster failure might have taken

place. The most common example of an alarm is the following:

• Notification that a cluster has been in an unreachable state for a

significant amount of time that you specify.

The expected process in dealing with cluster events that persist at the

alarm level is to obtain as much information as possible, including

authorization to recover, if your business practices require this, and then

to issue the recovery command.

Creating Notifications for Failure Events

For events that might indicate cluster failure, you can show the

escalation of your concern over cluster health by defining alerts followed

by one or more alarms. A typical sequence is to issue a cluster alert at 5

minutes and 10 minutes followed by a cluster alarm at 15 minutes. This

could be accomplished by entering two CLUSTER_ALERT lines in the

configuration file, and one CLUSTER_ALARM line. A detailed example is

provided in the comments in the ASCII configuration file template,

shown in “Editing Section 3—Monitoring Definitions” on page 226

Creating Notifications for Events that Indicate a

Return of Service

For those events that indicate that the cluster is back online or that

communication with the monitor has been restored, use cluster alerts to

show the de-escalation of concern. In this case, use a CLUSTER_ALERT line

in the configuration file with a time of zero (0), so that notifications are

sent as soon as the return to service is detected.