HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

181

182

183

184

185

186

187

188

189

190

lsf up infinite 1 down n17

swaptest up infinite 4 idle n[1-4]

In this example, node n17 is down.

The squeue utility reports the state of jobs currently running under the SLURM's control. For

more information about the squeue utility, see squeue(1).

The SLURM log files on each node in /var/slurm/log are helpful for diagnosing specific

problems. The log files slurmctld.log and slurmd.log log entries from their respective

daemons. Both these log files have the following format:

[ date and time stamp] Log Entry

15.6 Draining Nodes

Use the SLURM scontrol command to change a node's state. SLURM provides DRAIN and

DOWN states for taking nodes out of service. Draining a node means that the current job is allowed

to finish on that node while no other jobs are scheduled for that node.

There are a variety of reasons why a node must be drained. For example, you may want exclusive

use of a node to perform diagnostics on it or you may need to replace it.

To drain one or more nodes use the scontrol command as follows:

# scontrol update NodeName=nodelist State=drain Reason="describe reason here"

See “The nodelist Parameter” (page 36) for a discussion on the use of the nodelist parameter.

The reason that you provide for the node draining is displayed by the sinfo command. Be brief

but descriptive.

Here, node n17 is drained so that it can be removed from service for maintenance:

# scontrol update nodename=n17 state=drain reason="maintenance"

After the node has drained, use the scontrol command to remove a node from service. The

following shows the command to remove the drained node in the example, node n17.

# scontrol update nodename=n17 state=down

The scontrol command returns nodes to an IDLE state so that they can be reused. The following

command places n17 in the IDLE state to return it to service:

# scontrol update NodeName=nodelist State=resume

When returning a node to service, HP recommends that you set the state to DRAIN, even if no

jobs are currently running. This has two advantages:

• It is easier to recognize nodes that are down unexpectedly when skimming the output of

the sinfo command.

• If the node is rebooted accidentally or a as part of the maintenance procedure, the DRAIN

state persists. The DOWN state may or may not persist, pending on the setting of the

NodeName/State parameter in the slurm.conf file.

Table 15-4 shows the corresponding meaning of the output of the sinfo command for various

transitions:

Table 15-4 Output of the sinfo command for Various Transitions

Meaning:sinfo shows:Transition Cause:

The node is running a job

alloc

Transient Network Congestion

The slurmctld daemon has lost

contact with the node

alloc*

Contact between the node and the

slurmctld daemon has been restored

alloc

15.6 Draining Nodes 183