HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

261

262

263

264

265

266

267

268

269

270

Healthy node is down

The most common reason for SLURM to list an apparently healthy node down is that a

specified resource has dropped below the level defined for the node in the

/hptc_cluster/slurm/etc/slurm.conf file.

For example, if the temporary disk space specification is TmpDisk=4096, but the available

temporary disk space falls below 4 GB on the system, SLURM marks it as down.

SLURM refuses to operate on some nodes

If SLURM refuses to operate on some or all nodes, and the log files in /var/slurm/log

report problems with credentials, execute the following command to confirm that all nodes

display the same time:

# cexec -a date

A matter of a few seconds is inconsequential, but SLURM is unable to recognize the credentials

of nodes within the HP XC system that are more than 5 minutes out of synchronization.

Checking SLURM daemons

Use the following command to confirm that your control daemons are up and running:

# scontrol ping

Checking node and partition status

Use the following command to examine the status of your nodes and partitions:

# sinfo --all

Benign Updating SLURM Error

This note only applies to systems using a QsNet

interconnect.

The last few lines of output of the spconfig command might contain the following error:

Updating SLURM...

slurm_reconfigure error: Slurm backup controller in standby mode

SLURM Post-Configuration Done.

This error is intermittent and benign. It is the result of the spconfig command updating

SLURM with the compute node information too soon after it restarted SLURM to include

the elanhosts information. No corrective action is required.

21.7 LSF-HPC Troubleshooting

Take the following steps if you are have trouble submitting jobs or controlling LSF-HPC with

SLURM:

• Ensure that the number of nodes in the lsf partition is less than or equal to the number of

nodes reported in the XC.lic file. Sample entries follow:

INCREMENT XC-CPUS Compaq auth.number exp. date nodes ...

INCREMENT XC-PROCESSORS Compaq auth.number exp. date nodes ...

The value for nodes in the XC-CPUS or XC-PROCESSORS entry specifies the number of

licensed nodes for this system. If this value does not match the number of actual nodes, the

LSF service may fail to start LSF.

Use the lshosts command to determine the number of processors reported by LSF.

• Ensure that the date is synchronized throughout the HP XC system.

• Verify that the /hptc_cluster directory (file system) is properly mounted on all nodes.

SLURM relies on this file system.

• Ensure that SLURM is configured, up, and running properly.

• Examine the SLURM log files in /var/slurm/log/ directory on the SLURM master node

for any problems.

• If the sinfo command reports that the node is down and daemons are running, examine

the available processors vs. Procs setting in the slurm.conf file.

21.7 LSF-HPC Troubleshooting 263