Managing HP Serviceguard A.11.20.20 for Linux, May 2013

3.2.2 Heartbeat Messages

Central to the operation of the cluster manager is the sending and receiving of heartbeat messages

among the nodes in the cluster. Each node in the cluster exchanges UDP heartbeat messages with

every other node over each IP network configured as a heartbeat device.

If a cluster node does not receive heartbeat messages from all other cluster nodes within the

prescribed time, a cluster re-formation is initiated; see “What Happens when a Node Times Out”

(page 75). At the end of the re-formation, if a new set of nodes form a cluster, that information is

passed to the package coordinator (described later in this chapter, under “How the Package

Manager Works” (page 43)). Failover packages that were running on nodes that are no longer

in the new cluster are transferred to their adoptive nodes.

If heartbeat and data are sent over the same LAN subnet, data congestion may cause Serviceguard

to miss heartbeats and initiate a cluster re-formation that would not otherwise have been needed.

For this reason, HP recommends that you dedicate a LAN for the heartbeat as well as configuring

heartbeat over the data network.

Each node sends its heartbeat message at a rate calculated by Serviceguard on the basis of the

value of the MEMBER_TIMEOUT parameter, set in the cluster configuration file, which you create

as a part of cluster configuration.

IMPORTANT: When multiple heartbeats are configured, heartbeats are sent in parallel;

Serviceguard must receive at least one heartbeat to establish the health of a node. HP recommends

that you configure all subnets that interconnect cluster nodes as heartbeat networks; this increases

protection against multiple faults at no additional cost.

Heartbeat IP addresses must be on the same subnet on each node, but it is possible to configure

a cluster that spans subnets; see “Cross-Subnet Configurations” (page 27). See HEARTBEAT_IP,

under “Cluster Configuration Parameters ” (page 91), for more information about heartbeat

requirements. For timeout requirements and recommendations, see the MEMBER_TIMEOUT parameter

description in the same section. For troubleshooting information, see “Cluster Re-formations Caused

by MEMBER_TIMEOUT Being Set too Low” (page 258). See also “Cluster Daemon: cmcld” (page 34).

3.2.3 Manual Startup of Entire Cluster

A manual startup forms a cluster out of all the nodes in the cluster configuration. Manual startup

is normally done the first time you bring up the cluster, after cluster-wide maintenance or upgrade,

or after reconfiguration.

Before startup, the same binary cluster configuration file must exist on all nodes in the cluster. The

system administrator starts the cluster with the cmruncl command issued from one node. The

cmruncl command can only be used when the cluster is not running, that is, when none of the

nodes is running the cmcld daemon.

During startup, the cluster manager software checks to see if all nodes specified in the startup

command are valid members of the cluster, are up and running, are attempting to form a cluster,

and can communicate with each other. If they can, then the cluster manager forms the cluster.

3.2.4 Automatic Cluster Startup

An automatic cluster startup occurs any time a node reboots and joins the cluster. This can follow

the reboot of an individual node, or it may be when all nodes in a cluster have failed, as when

there has been an extended power failure and all SPUs went down.

Automatic cluster startup will take place if the flag AUTOSTART_CMCLD is set to 1 in the $SGCONF/

cmcluster.rc file. When any node reboots with this parameter set to 1, it will rejoin an existing

cluster, or if none exists it will attempt to form a new cluster.

3.2 How the Cluster Manager Works 39