SLURM Reference Manual for HP XC System Software

SLURM Goals and Roles
SLURM Goals
SLURM was developed specically to meet locally important criteria for a helpful, efcient way to
manage compute resources on large (Linux/CHAOS) clusters. The primary threefold purpose of a cluster
resource manager (such as LoadLeveler on LC's IBM ASC machines or the Resource Management System
(RMS) from Quadrics) is to:
Allocate nodes--
give users access (perhaps even exclusive access) to compute nodes for some specied time range
so their job(s) can run.
Control job execution--
provide the underlying mechanisms to start, run, cancel, and monitor the state of parallel (or serial)
jobs on the nodes allocated.
Manage contention--
reconcile competing requests for limited resources, usually by managing a queue of pending jobs.
At LC, an adequate cluster resource manager needs to meet two general requirements:
Scalable--
It must operate well on clusters with as many as several thousand nodes, including cases where the
nodes are heterogeneous (with different hardware or conguration features).
Portable--
It must ultimately support jobs on clusters that have different operating systems or versions, different
architectures, different vendors, and different interconnect networks. Linux/CHAOS is, of course,
the intended rst home for this software, however.
Any LC resource manager must also meet two additional, locally important, requirements:
Compatible with LCRM (DPCS)--
Since a resource manager is not a complex scheduler nor a complete batch system with across-cluster
accounting and reporting features, it must support and work well within such a larger, more
comprehensive job-control framework. At LC, the Livermore Computing Resource Management
system (formerly called DPCS (URL: http://www.llnl.gov/LCdocs/dpcs)) provides that framework
(see also the next section (page 8)).
Compatible with QsNet--
Since LC's Linux Project has already rened QsNet as its preferred high-speed interconnect for
Linux/CHAOS clusters, an adequate resource manager must also allocate Quadrics QsNet resources
along with compute nodes. But conversely, interconnect independence and the ability to easily support
other brands of interconnect (such as Myrinet) is important too. Such independence allows great
exibility in pursuing new hardware congurations in future clusters.
Finally, to t well into LC's emerging CHAOS environment, a resource manager should ideally have
these three very benecial extra properties as well:
SLURM Reference Manual - 6