HP XC System Software Administration Guide Version 3.2

15 Managing SLURM
The HP XC system uses the Simple Linux Utility for Resource Management (SLURM). This
chapter addresses the following topics:
“Overview of SLURM” (page 169)
“Configuring SLURM” (page 170)
“Restricting User Access to Nodes” (page 178)
“Job Accounting” (page 178)
“Monitoring SLURM” (page 182)
“Draining Nodes” (page 183)
“Configuring the SLURM Epilog Script” (page 184)
“Maintaining the SLURM Daemon Log” (page 185)
“Enabling SLURM to Recognize a New Node” (page 186)
“Removing SLURM” (page 187)
For your convenience, the HP XC Documentation CD contains the SLURM Reference Manual,
which is also available from the following web address:
http://www.llnl.gov/LCdocs/slurm/
IMPORTANT: If SLURM was not configured during the installation of the HP XC System
Software and you want to configure it now, you must rerun the cluster_config utility. For
more information, see the HP XC System Software Installation Guide.
15.1 Overview of SLURM
SLURM provides a simple, lightweight, scalable infrastructure for managing the computing
resources of the HP XC system. SLURM contains a job launcher, srun, that offers much flexibility
in requesting resources and dispatching serial or parallel applications. SLURM also features a
Pluggable Authentication Module that, when enabled, can provide more control over access to
the computing resources.
SLURM uses two daemons on the HP XC system:
slurmd
This daemon runs on each compute node in the HP XC system and is responsible
for the following:
Starting each job on its node
Monitoring the job's resource use
Enforcing limits (for example, memory size)
Freeing up resources when the job completes
The slurmd daemon runs as root to control starting and managing user jobs.
slurmctld
This SLURM controller daemon runs on the node with the resource manager role
as a central controller daemon. It is responsible for the following:
Monitoring the availability of the compute nodes
Managing node characteristics and node partitions
Managing jobs, that is, the queuing, scheduling, and maintaining the state
of jobs
Primary and backup slurmctld daemons run on separate resource manager
nodes.
SLURM also enables you to configure a backup slurmctld daemon. If present, this backup
daemon monitors the state of the primary slurmctld daemon. If the backup daemon detects
that the slurmctld daemon failed, the backup daemon assumes the responsibilities of the
15.1 Overview of SLURM 169