SLURM Reference Manual for HP XC System Software
Table Of Contents
- Preface
- Introduction
- SLURM Goals and Roles
- SLURM Features
- SLURM Operation
- SLURM Utilities
- SRUN (Submit Jobs)
- SQUEUE (List Jobs)
- SINFO (List Nodes)
- SMAP (Show Job Geometry)
- SCONTROL (Manage Configurations)
- Disclaimer
- Keyword Index
- Alphabetical List of Keywords
- Date and Revisions
SLURM Goals and Roles
SLURM Goals
SLURM was developed specifically to meet locally important criteria for a helpful, efficient way to
manage compute resources on large (Linux/CHAOS) clusters. The primary threefold purpose of a cluster
resource manager (such as LoadLeveler on LC's IBM ASC machines or the Resource Management System
(RMS) from Quadrics) is to:
•
Allocate nodes--
give users access (perhaps even exclusive access) to compute nodes for some specified time range
so their job(s) can run.
•
Control job execution--
provide the underlying mechanisms to start, run, cancel, and monitor the state of parallel (or serial)
jobs on the nodes allocated.
•
Manage contention--
reconcile competing requests for limited resources, usually by managing a queue of pending jobs.
At LC, an adequate cluster resource manager needs to meet two general requirements:
•
Scalable--
It must operate well on clusters with as many as several thousand nodes, including cases where the
nodes are heterogeneous (with different hardware or configuration features).
•
Portable--
It must ultimately support jobs on clusters that have different operating systems or versions, different
architectures, different vendors, and different interconnect networks. Linux/CHAOS is, of course,
the intended first home for this software, however.
Any LC resource manager must also meet two additional, locally important, requirements:
•
Compatible with LCRM (DPCS)--
Since a resource manager is not a complex scheduler nor a complete batch system with across-cluster
accounting and reporting features, it must support and work well within such a larger, more
comprehensive job-control framework. At LC, the Livermore Computing Resource Management
system (formerly called DPCS (URL: http://www.llnl.gov/LCdocs/dpcs)) provides that framework
(see also the next section (page 8)).
•
Compatible with QsNet--
Since LC's Linux Project has already refined QsNet as its preferred high-speed interconnect for
Linux/CHAOS clusters, an adequate resource manager must also allocate Quadrics QsNet resources
along with compute nodes. But conversely, interconnect independence and the ability to easily support
other brands of interconnect (such as Myrinet) is important too. Such independence allows great
flexibility in pursuing new hardware configurations in future clusters.
Finally, to fit well into LC's emerging CHAOS environment, a resource manager should ideally have
these three very beneficial extra properties as well:
SLURM Reference Manual - 6