HP XC System Software Administration Guide Version 3.2

Run 'sinfo' for more information.
n[3-7] [Slurm Status - Critical] sinfo reported problems
with partitions for this node
nh [Supermon Metrics Monitor - Critical] The metrics
monitor has returned a critical status indicating a
number of nodes have reported critical thresholds.
If the actual status is 'Service timed out' then
the monitor has taken too long to complete a single
iteration. To verify this run the monitor
manually: 'time
/opt/hptc/supermon/bin/storeMetrics' from the head
node to see if it takes more then about 2-3 minutes
(max on a large cluster)
nh [Syslog Alert Monitor - NOOUTPUT] The
check_syslogalerts plug-in failed to return any
status. This could indicate a problem with the
consolidated log or resources needed to execute the
plug-in.
nh [System Event Log Monitor - NRPEUNABLETOREAD]
Indicates the remote command request to check the
SEL logs for a group of nodes has failed to return
any status. This may indicate a failure of the
check_selmon command. The System Event Log Monitor
must proxy to a console connected node to collect
console related data. NRPE is used to proxy these
requests to a console connected node. These nodes
are identified as members of the 'console_network'
role Verify that the check_selmon command can run
on those nodes, i.e., schedule_service --directive
check_selmon_for_mh_xxxxxx where xxxxxx is the name
of the management hub reporting the failure. Look
at nagios.log and the consolidated.log to see if
there are any indications of failures for NRPE
nh [System Event Log - IPMICONNECTFAIL] The check_sel
plug-in failed to connect to the console port for
this node, common cause is the console device cp-
xxxxx, is not reachable. If this is the head node
and the head node is externally connected, you may
be able to define cp-xxxxx in /etc/hosts using the
external IP to allow connectivity. Sensor
collection may not be possible when using
externally connected console ports for head nodes
on platforms that use IPMI to gather sensor
information. If this is not the head node then it
may indicate a communication problem with the
associated console device 'cp-{nodename}'.
n[3-7] nh [System Free Space - ASSUMEDOK] Pending services
are normal, they indicate data has not yet been
received by the Nagios engine. Service *may* be
fine, but if it continues to pend for more then
about 30 minutes it may indicate data is not being
collected.
This utility can generate:
A list of nodes according to their severity:
Critical
Warning
128 Monitoring the System with Nagios