HP XC System Software Installation Guide Version 4.0

Use the following command to determine why SLURM marked this node as being down:
# sinfo -R
REASON NODELIST
Low RealMemory [slurm@Mar 02 22:34] n15
The most common reason reported by the sinfo command is Not Responding, which means
that something is wrong with the communication between the primary slurmctld daemon
and the slurmd daemon on the affected node or nodes . In that situation, log in to the affected
node or nodes and troubleshooting the slurmd daemon.
The sinfo example shown in this section illustrates the Low RealMemory reason. It is more
obscure and can be a side effect of the system configuration process. This error is reported because
the SLURM slurm.conf file is configured with a RealMemory value that is higher than the
MemTotal value in the /proc/meminfo file that is being reported by the compute node. SLURM
does not automatically restore a node that had failed at any point because of this reason.
Assuming that the memory hardware is functioning, follow this procedure to resolve the problem:
1. Ensure that the database has the correct total memory value for the affected node. In this
example, n15 is the affected node.
# pdsh -w n15 /opt/hptc/etc/nconfig.d/C50gather_data
2. Configure SLURM with the correct memory value for this node:
# spconfig
Configured nodes n[1-13] with 2 CPUs and 3017 MB of total memory...
Configured node n14 with 4 CPUs and 3522 MB of total memory...
Configured node n15 with 8 CPUs and 3648 MB of total memory...
Configured node n16 with 4 CPUs and 2008 MB of total memory...
Updating SLURM...
SLURM Post-Configuration Done.
3. Restore the affected node back into operation:
# scontrol update NodeName=n15 State=idle
4. Verify that the LSF partition exists and all nodes are in the idle state:
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lsf up infinite 16 idle n[1-16]
14.6.1 SLURM Reconfiguration Errors
This note only applies to systems using a QsNet
II
interconnect.
The last few lines of output of the spconfig command might contain the following error:
Updating SLURM...
slurm_reconfigure error: Slurm backup controller in standby mode
SLURM Post-Configuration Done.
This error is intermittent and benign and is caused by the spconfig command updating SLURM
with the compute node information too soon after it restarted SLURM to include the elanhosts
information. No corrective action is required.
14.7 Troubleshooting the Software Upgrade Procedure
The following list provides suggestions for troubleshooting problems you might encounter when
upgrading the HP XC System Software from a previous release to this release:
Look at the upgrade log files to determine if there were any upgrade failures. Table 14-2
(page 185) lists the log files that are generated during a software upgrade.
184 Troubleshooting