HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

161

162

163

164

165

166

167

168

169

170

15 Managing SLURM

The HP XC system uses the Simple Linux Utility for Resource Management (SLURM). This

chapter addresses the following topics:

• “Overview of SLURM” (page 169)

• “Configuring SLURM” (page 170)

• “Restricting User Access to Nodes” (page 178)

• “Job Accounting” (page 178)

• “Monitoring SLURM” (page 182)

• “Draining Nodes” (page 183)

• “Configuring the SLURM Epilog Script” (page 184)

• “Maintaining the SLURM Daemon Log” (page 185)

• “Enabling SLURM to Recognize a New Node” (page 186)

• “Removing SLURM” (page 187)

For your convenience, the HP XC Documentation CD contains the SLURM Reference Manual,

which is also available from the following web address:

http://www.llnl.gov/LCdocs/slurm/

IMPORTANT: If SLURM was not configured during the installation of the HP XC System

Software and you want to configure it now, you must rerun the cluster_config utility. For

more information, see the HP XC System Software Installation Guide.

15.1 Overview of SLURM

SLURM provides a simple, lightweight, scalable infrastructure for managing the computing

resources of the HP XC system. SLURM contains a job launcher, srun, that offers much flexibility

in requesting resources and dispatching serial or parallel applications. SLURM also features a

Pluggable Authentication Module that, when enabled, can provide more control over access to

the computing resources.

SLURM uses two daemons on the HP XC system:

slurmd

This daemon runs on each compute node in the HP XC system and is responsible

for the following:

• Starting each job on its node

• Monitoring the job's resource use

• Enforcing limits (for example, memory size)

• Freeing up resources when the job completes

The slurmd daemon runs as root to control starting and managing user jobs.

slurmctld

This SLURM controller daemon runs on the node with the resource manager role

as a central controller daemon. It is responsible for the following:

• Monitoring the availability of the compute nodes

• Managing node characteristics and node partitions

• Managing jobs, that is, the queuing, scheduling, and maintaining the state

of jobs

Primary and backup slurmctld daemons run on separate resource manager

nodes.

SLURM also enables you to configure a backup slurmctld daemon. If present, this backup

daemon monitors the state of the primary slurmctld daemon. If the backup daemon detects

that the slurmctld daemon failed, the backup daemon assumes the responsibilities of the

15.1 Overview of SLURM 169