Managing HP Serviceguard A.11.20.20 for Linux, May 2013

ManualsBrandsHP ManualsSoftwareHP SAP Linux Serviceguard Cluster Extension

251

252

253

254

255

256

257

258

259

260

Unable to halt the detached package <package_name> on node <node_name>

as the node is not reachable. Retry once the node is reachable.

In such a case, the node should be powered up and be accessible. You must then rerun the

cmhaltpkg command.

8.8.3 Cluster Re-formations Caused by Temporary Conditions

You may see Serviceguard error messages, such as the following, which indicate that a node is

having problems:

Member node_name seems unhealthy, not receiving heartbeats from it.

This may indicate a serious problem, such as a node failure, whose underlying cause is probably

a too-aggressive setting for the MEMBER_TIMEOUT parameter; see the next section, “Cluster

Re-formations Caused by MEMBER_TIMEOUT Being Set too Low”. Or it may be a transitory problem,

such as excessive network traffic or system load.

What to do: If you find that cluster nodes are failing because of temporary network or system-load

problems (which in turn cause heartbeat messages to be delayed in network or during processing),

you should solve the networking or load problem if you can. Failing that, you can increase the

value of MEMBER_TIMEOUT, as described in the next section.

8.8.4 Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too Low

If you have set the MEMBER_TIMEOUT parameter too low, the cluster demon, cmcld, will write

warnings to syslog that indicate the problem. There are three in particular that you should watch

for:

1. Warning: cmcld was unable to run for the last <n.n> seconds. Consult

the Managing Serviceguard manual for guidance on setting

MEMBER_TIMEOUT, and information on cmcld.

This means that cmcld was unable to get access to a CPU for a significant amount of time.

If this occurred while the cluster was re-forming, one or more nodes could have failed. Some

commands (such as cmhaltnode (1m), cmrunnode (1m), cmapplyconf (1m)), cause

the cluster to re-form, so there's a chance that running one of these commands could precipitate

a node failure; that chance is greater the longer the hang.

What to do: If this message appears once a month or more often, increase MEMBER_TIMEOUT

to more than 10 times the largest reported delay. For example, if the message that reports the

largest number says that cmcld was unable to run for the last 1.6 seconds, increase

MEMBER_TIMEOUT to more than 16 seconds.

2. This node is at risk of being evicted from the running cluster.

Increase MEMBER_TIMEOUT.

This means that the hang was long enough for other nodes to have noticed the delay in

receiving heartbeats and marked the node “unhealthy”. This is the beginning of the process

of evicting the node from the cluster; see “What Happens when a Node Times Out” (page 75)

for an explanation of that process.

What to do: In isolation, this could indicate a transitory problem, as described in the previous

section. If you have diagnosed and fixed such a problem and are confident that it won't recur,

you need take no further action; otherwise you should increase MEMBER_TIMEOUT as instructed

in item 1.

3. Member node_name seems unhealthy, not receiving heartbeats from it.

This is the message that indicates that the node has been found “unhealthy” as described in

the previous bullet.

What to do: See item 2.

258 Troubleshooting Your Cluster