HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

121

122

123

124

125

126

127

128

129

130

Example 8-1 The nrg Utility System State Analysis

# nrg --mode analyze

Nodelist Description

---------------------- ---------------------------------------------------

n[3-7] nh [Environment - NODATA] No sensor data is available

for reporting. Use 'shownode metrics sensors --

last 20m node xxxx' for each of these nodes to

verify if sensor data has been recently collected.

This status is drawn from the same source as the

shownode metrics sensors command. Look at the

status of the 'Sensor Collection Monitor' as that

plug-in causes the population of this data.

nh [Host Monitor - Critical] A significant percentage

of nodes are reported as down, you can run

check_node_status --list to see what Nagios

believes the state of the nodes are.

nh [LSF Failover Monitor - Critical] The LSF demon is

reporting as down. If failover is disabled (try

'controllsf show failover'), you can attempt to

restart LSF with 'controllsf start'. If failover

is enabled and you see this message, it is likely

that all of your nodes with the resource management

role are down, or there is a fatal LSF

configuration error (look at the LSF log files).

n[3-7] nh [NodeInfo - ASSUMEDOK] Pending services are normal,

they indicate data has not yet been received by the

Nagios engine. Service *may* be fine, but if it

continues to pend for more then about 30 minutes it

may indicate data is not being collected.

n[3-7] [PING Interconnect - Critical] This typically

indicates a node is down, however, it could also

indicate a non-functioning interconnect if the

nodes is up and operational.

nh [Resource Monitor - NOOUTPUT] A service has failed

to return an output status. Typically this

indicates a plug-in failure. Run the plug-in

directly to observe any error conditions. In some

cases, this exact message is returned from

check_nrpe when a nrpe directive is failing to

execute a command. If you can determine which nrpe

command is being requested by the associated plug-

in (see /opt/hptc/nagios/etc/nrpe_local.cfg for a

list) you can test it using the 'check_nrpe -H

nodename -c command' plug-in.

nh [Sensor Collection Monitor - Critical] Many nodes

have returned warning or critical sensor status.

If message is 'Service Timeout', collection is

taking too long (>5 minutes or so). This could

indicate a problem on one of the nodes in the

console_network role (shownode roles

console_network) or a problem running ipmitool.

Try running the sensors command directly, time

/opt/hptc/supermon/bin/sensors

nh [Slurm Monitor - Critical] 'sinfo' reported

problems with nodes in some partitions,

specifically, some nodes may be marked with an '*'

which indicates they may be unresponsive to SLURM.

8.6 Nagios Report Generator Utility 127