SLURM Reference Manual for HP XC System Software

Introduction
SLURM is LC's locally developed C-language Simple Linux Utility for Resource Management. SLURM
is a job- and compute-resource manager that can run reliably and efciently on Linux (CHAOS) clusters
as large as several thousand nodes. Its features suit it to large-scale, high-performance computing
environments, and its design avoids known weaknesses (such as inexibility or fault intolerance) in available
commercial resource management products for supercomputers.
This manual summarizes the specic service goals that SLURM was developed to meet, and explains
the roles that it plays (relative to the Livermore Computing Resource Management (LCRM/DPCS) system,
for example) on LC production machines. Key to SLURM's operation are two software daemons: one
(SLURMCTLD) controls the job queue and resource allocations, while the other (SLURMD) shepherds
executing jobs on each compute node. Sections below explain the features and subsystems of each SLURM
daemon. Additional sections tell how use of "plugin modules" make SLURM easily adaptable to many
hardware situations, and introduce the ve utility programs that give SLURM its direct user interface.
SRUN is the SLURM utility central to launching, assigning resources to, and guiding the execution of
parallel jobs managed by SLURM, both interactively and through batch queues. Hence, the ve ways to
use SRUN (its "modes"), SRUN's complex I/O redirection support, and the often-elaborate interaction
among the many SRUN options receive careful attention in several subsections devoted to that tool. SRUN
also interacts with a set of special SLURM environment variables (like those used for job management by
IBM's POE), explained in another subsection. Detailed and customizable monitoring of SRUN-submitted
jobs is provided by SQUEUE, whose options we also compare and illustrate with annotated output cases.
Likewise, to plan SRUN use you can monitor SLURM-managed nodes by executing or customizing a
separate SLURM tool called SINFO, with its own section below. Checkpoint support using SCONTROL
is introduced as well.
SLURM development is part of LC's larger CHAOS open-source operating system project, as explained
in the separate CHAOS Reference Manual. (URL: http://www.llnl.gov/LCdocs/chaos) For a summary of
known, signicant differences between LC's Linux machines and those running AIX or Tru64 UNIX, see
the Linux Differences (URL: http://www.llnl.gov/LCdocs/linux) guide. And for general advice on managing
(batch) jobs on LC production machines, consult the examples and comparisons in the basic
EZJOBCONTROL (URL: http://www.llnl.gov/LCdocs/ezjobcontrol) guide.
SLURM Reference Manual - 5