Designing Disaster Recovery Clusters using Metroclusters and Continentalclusters, Reprinted October 2011 (5900-1881)

ManualsBrandsHP ManualsSoftwareHP Serviceguard Metrocluster with EMC SRDF

Monitoring over a Network

A monitor package running on one cluster tracks the health of another cluster in the recovery pair

and sends notification to configured destinations if the state of the monitored cluster changes. (If

a cluster contains any packages to be recovered it must be monitored.) The monitor software polls

the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file,

which also indicates when and where to send messages if there is a state change.

The physical separation between clusters will require communication by way of a Local or Wide

Area Network (LAN or WAN). Since the polling takes place across the network, interruptions of

network service cannot always be differentiated from cluster failure states. This means that if the

network is unreliable, the monitoring facility will often detect and report an unreachable state for

the monitored cluster that is actually an interruption of the network service.

Because the monitoring is indeterminate in some instances, information from independent sources

must be gathered to determine the need for proceeding with the recovery process. For these reasons,

cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the

cluster recovery is automated to reduce the chance of human error that might occur if manual steps

were needed. In Continentalclusters, a system of cluster events and notifications is provided so

that events can be easily tracked, and users will know when to seek additional information before

initiating recovery.

Cluster Events

A cluster event is a change of state in a monitored cluster. The four cluster states reported by the

monitor are Unreachable, Down, Up, and Error. Table 6 summarizes possible causes for the cluster

events with regard to both the monitored cluster and the network. However, in many cases the

causes of cluster events are indeterminate without additional information that is not available to

the software.

Table 6 Monitored States and Possible Causes

Network-related CausesCluster-related CausesCluster Event (Old state ->

New state)

Network failureCluster went down; no nodes are responding

to network inquiries

Up -> Unreachable

Network failureCluster was down and nodes are no longer

responding

Down -> Unreachable

Network failureError resolved but cluster down and nodes not

responding

Error -> Unreachable

No network problemsCluster has been halted, but at least one node

is still responding to network inquiries

Up -> Down

Network problem was fixed, cluster is

down

Error resolved, cluster is downError -> Down

Network came up but the cluster was

not running

Cluster nodes were rebooted but the cluster was

not started

Unreachable -> Down

Network is misconfigured, or DNS

server crashed or set up incorrectly

Serviceguard version or security file mismatch,

software error

Up -> Error

Network is misconfigured, or DNS

server crashed or set up incorrectly

Serviceguard version or security file mismatch,

software error

Down -> Error

Network problem was fixed, but the

error condition still exists

Serviceguard version or security file mismatch,

software error

Unreachable -> Error

No network problemsCluster startedDown -> Up

40 Designing Continentalclusters