HP XC System Software Administration Guide Version 3.2

Healthy node is down
The most common reason for SLURM to list an apparently healthy node down is that a
specified resource has dropped below the level defined for the node in the
/hptc_cluster/slurm/etc/slurm.conf file.
For example, if the temporary disk space specification is TmpDisk=4096, but the available
temporary disk space falls below 4 GB on the system, SLURM marks it as down.
SLURM refuses to operate on some nodes
If SLURM refuses to operate on some or all nodes, and the log files in /var/slurm/log
report problems with credentials, execute the following command to confirm that all nodes
display the same time:
# cexec -a date
A matter of a few seconds is inconsequential, but SLURM is unable to recognize the credentials
of nodes within the HP XC system that are more than 5 minutes out of synchronization.
Checking SLURM daemons
Use the following command to confirm that your control daemons are up and running:
# scontrol ping
Checking node and partition status
Use the following command to examine the status of your nodes and partitions:
# sinfo --all
Benign Updating SLURM Error
This note only applies to systems using a QsNet
II
interconnect.
The last few lines of output of the spconfig command might contain the following error:
Updating SLURM...
slurm_reconfigure error: Slurm backup controller in standby mode
SLURM Post-Configuration Done.
This error is intermittent and benign. It is the result of the spconfig command updating
SLURM with the compute node information too soon after it restarted SLURM to include
the elanhosts information. No corrective action is required.
21.7 LSF-HPC Troubleshooting
Take the following steps if you are have trouble submitting jobs or controlling LSF-HPC with
SLURM:
Ensure that the number of nodes in the lsf partition is less than or equal to the number of
nodes reported in the XC.lic file. Sample entries follow:
INCREMENT XC-CPUS Compaq auth.number exp. date nodes ...
INCREMENT XC-PROCESSORS Compaq auth.number exp. date nodes ...
The value for nodes in the XC-CPUS or XC-PROCESSORS entry specifies the number of
licensed nodes for this system. If this value does not match the number of actual nodes, the
LSF service may fail to start LSF.
Use the lshosts command to determine the number of processors reported by LSF.
Ensure that the date is synchronized throughout the HP XC system.
Verify that the /hptc_cluster directory (file system) is properly mounted on all nodes.
SLURM relies on this file system.
Ensure that SLURM is configured, up, and running properly.
Examine the SLURM log files in /var/slurm/log/ directory on the SLURM master node
for any problems.
If the sinfo command reports that the node is down and daemons are running, examine
the available processors vs. Procs setting in the slurm.conf file.
21.7 LSF-HPC Troubleshooting 263