HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

251

252

253

254

255

256

257

258

259

260

Typically, this entry reports the number of new records processed in the

/hptc_cluster/adm/logs/consolidated.log file.

A warning or critical message occurs when there is insufficient time to process a huge volume of

messages before the Nagios service_check_timeout period expires.

Nagios examines the recent incoming consolidated log messages and issues a warning or critical

message if the incoming rate since last interval exceeds a configured number of records. The default

values are 2 for warnings and 20 for critical. See

/opt/hptc/nagios/libexec/check_syslogalerts for details.

No specific action is required unless the service times out. In that case, an excessive number of syslog

messages is collected across the system; this is more than the plug-in can process in the

service_check_timeout period. See the /opt/hptc/nagios/etc/nagios.cfg file for the

value of the service_check_timeout parameter. Running the following command on the node

reporting error solves the problem:

# /opt/hptc/nagios/libexec/check_syslogalerts –domain node:nagios_monitor –nsca

Otherwise, wait for the nightly log to roll over.

Service: Syslog Alerts

Status Information: Node Syslog alerts information

Typically, this entry reports the number of alerts in a specified period of time and allows you to access

the most recent log.

A warning or critical message indicates that one or more rules defined in the

/opt/hptc/nagios/etc/syslogAlertRules file matches the specified node's consolidated log

file.

Take the appropriate action based on the message.

Service: System Event Log

Status Information: Node Syslog alerts information

A warning or critical message indicates that one or more rules defined in the

/opt/hptc/nagios/etc/selRules file matches the specified node's firmware System Event Log.

Take the appropriate action based on the System Event Log message.

Service: System Free Space

Status Information: Node / and /var free space

This entry typically displays the status of the /, /var, and /hptc_cluster file systems on the node.

A warning or critical message indicates that the thresholds for the specific node were exceeded.

Clean up disk space.

21.4 System Interconnect Troubleshooting

This section describes the troubleshooting steps for the following supported system interconnects:

• “Myrinet System Interconnect Troubleshooting” (page 252)

• “Quadrics System Interconnect Troubleshooting” (page 253)

• “InfiniBand System Interconnect Troubleshooting” (page 255)

• “OFED Troubleshooting Procedures” (page 257)

21.4.1 Myrinet System Interconnect Troubleshooting

The following troubleshooting information applies to the Myrinet system interconnect. Perform

these steps on any node on which you suspect a problem to determine if your HP XC system is

configured properly. If these tests pass but you are still experiencing difficulty, see Chapter 20:

Using Diagnostic Tools (page 231).

1. Run the gm_board_info test:

# /opt/gm/bin/gm_board_info

This command displays all the nodes in the HP XC system.

252 Troubleshooting