Designing Disaster Tolerant High Availability Clusters, 10th Edition, March 2003 (B7660-90013)

Building a Continental Cluster
Understanding Continental Cluster Concepts
Chapter 5 187
The expected process in dealing with alerts is to continue watching for
additional notifications and to contact individuals at the site of the
monitored cluster to see whether problems exist.
Alarms
Alarms are intended to indicate that a cluster failure might have taken
place. The most common example of an alarm is the following:
Notification that a cluster has been in an unreachable state for a
significant amount of time that you specify.
The expected process in dealing with cluster events that persist at the
alarm level is to obtain as much information as possible, including
authorization to recover, if your business practices require this, and then
to issue the recovery command.
Creating Notifications for Failure Events
For events that might indicate cluster failure, you can show the
escalation of your concern over cluster health by defining alerts followed
by one or more alarms. A typical sequence is to issue a cluster alert at 5
minutes and 10 minutes followed by a cluster alarm at 15 minutes. This
could be accomplished by entering two CLUSTER_ALERT lines in the
configuration file, and one CLUSTER_ALARM line. A detailed example is
provided in the comments in the ASCII configuration file template,
shown in Editing Section 3Monitoring Definitions on page 226
Creating Notifications for Events that Indicate a
Return of Service
For those events that indicate that the cluster is back online or that
communication with the monitor has been restored, use cluster alerts to
show the de-escalation of concern. In this case, use a CLUSTER_ALERT line
in the configuration file with a time of zero (0), so that notifications are
sent as soon as the return to service is detected.