HP XC System Software Administration Guide Version 3.2

Nagios performs a ping command on the interconnect at regular intervals. Typically, this entry
provides the status information output from that command and the Interconnect's IP address.
A warning or critical message indicates that the specified node or system interconnect failed to respond
to the ping command in the allotted time.
Determine if the node is powered on, enabled, and responsive.
Determine if interconnect is functional by running the ping command with the Nagios host name
for the interconnect; for example, if the Nagios host name is necs1-1, enter the following comand:
# ping necs1-1
If interconnect responds but the time it takes to respond to a ping command is excessive, the problem
may be related to the system load. If there is no response to the ping command, determine that the
interconnect is configured improperly or if it is failing by running the corresponding system
interconnect diagnostic tools; see “Using the System Interconnect Diagnostic Tools” (page 238) for
more information.
Service: Resource Monitor
Status Information: Resource monitor activity status
Typically this entry reports the output of the SLURM squeue command.
A warning or critical message indicates that the SLURM squeue command reported errors.
See the output of the squeue command for more details. SLURM on HP XC systems is described in
Chapter 15 (page 169)
Service: Root key synchronization
Status Information: Root SSH key synchronization status
This entry provides the status of the root key synchronization.
A warning or critical message indicates that the root ssh keys for one or more hosts are out of
synchronization with the head node. The ssh and pdsh commands may not work for these nodes.
Verify that the imaging is correct on the affected nodes. The most common cause of this problem is
caused by a node that failed to reimage and booted a kernel with an older set of ssh keys
(/root/.ssh/*).
If all the nodes are not synchronized, determine if the head node changed its root ssh keys.
See “Mismatched Secure Shell Keys” (page 246) for more information.
Service: Supermon Metrics Monitor
Status Information: Supermon node metrics retrieval status
This entry reports the status of the Supermon service and the number of nodes from which it collected
metrics data.
A warning or critical message indicates that one or more hosts was not accessible during metrics
collection or there was a Nagios service_check_timeout interval timed out.
These messages can occur if metrics collection cannot be completed in a reasonable time; examine the
/opt/hptc/nagios/etc/nagios.cfg file for the value of the service_check_timeout
parameter.
The default should be adequate for HP XC systems with fewer than 256 nodes.
Increasing the value for the service_check_timeout parameter may solve the problem for systems
with more nodes.
Also, verify that the supermond service is running by invoking the following command on the head
node:
# service supermond status
Loss or time-outs of this service can cause per-node warnings for nodeinfo, load average and
system free space.
A non-timeout warning or critical message simply indicates a number of monitored nodes are not
responding; this is normal if the nodes are down or otherwise disabled.
Service: Syslog Alert Monitor
Status Information: Status of consolidated.log syslog monitoring
21.3 Messages Reported by Nagios 251