HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

251

252

253

254

255

256

257

258

259

260

Nagios performs a ping command on the interconnect at regular intervals. Typically, this entry

provides the status information output from that command and the Interconnect's IP address.

A warning or critical message indicates that the specified node or system interconnect failed to respond

to the ping command in the allotted time.

Determine if the node is powered on, enabled, and responsive.

Determine if interconnect is functional by running the ping command with the Nagios host name

for the interconnect; for example, if the Nagios host name is necs1-1, enter the following comand:

# ping necs1-1

If interconnect responds but the time it takes to respond to a ping command is excessive, the problem

may be related to the system load. If there is no response to the ping command, determine that the

interconnect is configured improperly or if it is failing by running the corresponding system

interconnect diagnostic tools; see “Using the System Interconnect Diagnostic Tools” (page 238) for

more information.

Service: Resource Monitor

Status Information: Resource monitor activity status

Typically this entry reports the output of the SLURM squeue command.

A warning or critical message indicates that the SLURM squeue command reported errors.

See the output of the squeue command for more details. SLURM on HP XC systems is described in

Chapter 15 (page 169)

Service: Root key synchronization

Status Information: Root SSH key synchronization status

This entry provides the status of the root key synchronization.

A warning or critical message indicates that the root ssh keys for one or more hosts are out of

synchronization with the head node. The ssh and pdsh commands may not work for these nodes.

Verify that the imaging is correct on the affected nodes. The most common cause of this problem is

caused by a node that failed to reimage and booted a kernel with an older set of ssh keys

(/root/.ssh/*).

If all the nodes are not synchronized, determine if the head node changed its root ssh keys.

See “Mismatched Secure Shell Keys” (page 246) for more information.

Service: Supermon Metrics Monitor

Status Information: Supermon node metrics retrieval status

This entry reports the status of the Supermon service and the number of nodes from which it collected

metrics data.

A warning or critical message indicates that one or more hosts was not accessible during metrics

collection or there was a Nagios service_check_timeout interval timed out.

These messages can occur if metrics collection cannot be completed in a reasonable time; examine the

/opt/hptc/nagios/etc/nagios.cfg file for the value of the service_check_timeout

parameter.

The default should be adequate for HP XC systems with fewer than 256 nodes.

Increasing the value for the service_check_timeout parameter may solve the problem for systems

with more nodes.

Also, verify that the supermond service is running by invoking the following command on the head

node:

# service supermond status

Loss or time-outs of this service can cause per-node warnings for nodeinfo, load average and

system free space.

A non-timeout warning or critical message simply indicates a number of monitored nodes are not

responding; this is normal if the nodes are down or otherwise disabled.

Service: Syslog Alert Monitor

Status Information: Status of consolidated.log syslog monitoring

21.3 Messages Reported by Nagios 251