Designing Disaster Recovery Clusters using Metroclusters and Continentalclusters, Reprinted October 2011 (5900-1881)

Monitoring over a Network
A monitor package running on one cluster tracks the health of another cluster in the recovery pair
and sends notification to configured destinations if the state of the monitored cluster changes. (If
a cluster contains any packages to be recovered it must be monitored.) The monitor software polls
the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file,
which also indicates when and where to send messages if there is a state change.
The physical separation between clusters will require communication by way of a Local or Wide
Area Network (LAN or WAN). Since the polling takes place across the network, interruptions of
network service cannot always be differentiated from cluster failure states. This means that if the
network is unreliable, the monitoring facility will often detect and report an unreachable state for
the monitored cluster that is actually an interruption of the network service.
Because the monitoring is indeterminate in some instances, information from independent sources
must be gathered to determine the need for proceeding with the recovery process. For these reasons,
cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the
cluster recovery is automated to reduce the chance of human error that might occur if manual steps
were needed. In Continentalclusters, a system of cluster events and notifications is provided so
that events can be easily tracked, and users will know when to seek additional information before
initiating recovery.
Cluster Events
A cluster event is a change of state in a monitored cluster. The four cluster states reported by the
monitor are Unreachable, Down, Up, and Error. Table 6 summarizes possible causes for the cluster
events with regard to both the monitored cluster and the network. However, in many cases the
causes of cluster events are indeterminate without additional information that is not available to
the software.
Table 6 Monitored States and Possible Causes
Network-related CausesCluster-related CausesCluster Event (Old state ->
New state)
Network failureCluster went down; no nodes are responding
to network inquiries
Up -> Unreachable
Network failureCluster was down and nodes are no longer
responding
Down -> Unreachable
Network failureError resolved but cluster down and nodes not
responding
Error -> Unreachable
No network problemsCluster has been halted, but at least one node
is still responding to network inquiries
Up -> Down
Network problem was fixed, cluster is
down
Error resolved, cluster is downError -> Down
Network came up but the cluster was
not running
Cluster nodes were rebooted but the cluster was
not started
Unreachable -> Down
Network is misconfigured, or DNS
server crashed or set up incorrectly
Serviceguard version or security file mismatch,
software error
Up -> Error
Network is misconfigured, or DNS
server crashed or set up incorrectly
Serviceguard version or security file mismatch,
software error
Down -> Error
Network problem was fixed, but the
error condition still exists
Serviceguard version or security file mismatch,
software error
Unreachable -> Error
No network problemsCluster startedDown -> Up
40 Designing Continentalclusters