HP XC System Software Administration Guide Version 3.2

Example 8-1 The nrg Utility System State Analysis
# nrg --mode analyze
Nodelist Description
---------------------- ---------------------------------------------------
n[3-7] nh [Environment - NODATA] No sensor data is available
for reporting. Use 'shownode metrics sensors --
last 20m node xxxx' for each of these nodes to
verify if sensor data has been recently collected.
This status is drawn from the same source as the
shownode metrics sensors command. Look at the
status of the 'Sensor Collection Monitor' as that
plug-in causes the population of this data.
nh [Host Monitor - Critical] A significant percentage
of nodes are reported as down, you can run
check_node_status --list to see what Nagios
believes the state of the nodes are.
nh [LSF Failover Monitor - Critical] The LSF demon is
reporting as down. If failover is disabled (try
'controllsf show failover'), you can attempt to
restart LSF with 'controllsf start'. If failover
is enabled and you see this message, it is likely
that all of your nodes with the resource management
role are down, or there is a fatal LSF
configuration error (look at the LSF log files).
n[3-7] nh [NodeInfo - ASSUMEDOK] Pending services are normal,
they indicate data has not yet been received by the
Nagios engine. Service *may* be fine, but if it
continues to pend for more then about 30 minutes it
may indicate data is not being collected.
n[3-7] [PING Interconnect - Critical] This typically
indicates a node is down, however, it could also
indicate a non-functioning interconnect if the
nodes is up and operational.
nh [Resource Monitor - NOOUTPUT] A service has failed
to return an output status. Typically this
indicates a plug-in failure. Run the plug-in
directly to observe any error conditions. In some
cases, this exact message is returned from
check_nrpe when a nrpe directive is failing to
execute a command. If you can determine which nrpe
command is being requested by the associated plug-
in (see /opt/hptc/nagios/etc/nrpe_local.cfg for a
list) you can test it using the 'check_nrpe -H
nodename -c command' plug-in.
nh [Sensor Collection Monitor - Critical] Many nodes
have returned warning or critical sensor status.
If message is 'Service Timeout', collection is
taking too long (>5 minutes or so). This could
indicate a problem on one of the nodes in the
console_network role (shownode roles
console_network) or a problem running ipmitool.
Try running the sensors command directly, time
/opt/hptc/supermon/bin/sensors
nh [Slurm Monitor - Critical] 'sinfo' reported
problems with nodes in some partitions,
specifically, some nodes may be marked with an '*'
which indicates they may be unresponsive to SLURM.
8.6 Nagios Report Generator Utility 127