SLURM Reference Manual for HP XC System Software

Table Of Contents

SLURM Goals and Roles

SLURM Goals

SLURM was developed speciﬁcally to meet locally important criteria for a helpful, efﬁcient way to

manage compute resources on large (Linux/CHAOS) clusters. The primary threefold purpose of a cluster

resource manager (such as LoadLeveler on LC's IBM ASC machines or the Resource Management System

(RMS) from Quadrics) is to:

•

Allocate nodes--

give users access (perhaps even exclusive access) to compute nodes for some speciﬁed time range

so their job(s) can run.

•

Control job execution--

provide the underlying mechanisms to start, run, cancel, and monitor the state of parallel (or serial)

jobs on the nodes allocated.

•

Manage contention--

reconcile competing requests for limited resources, usually by managing a queue of pending jobs.

At LC, an adequate cluster resource manager needs to meet two general requirements:

•

Scalable--

It must operate well on clusters with as many as several thousand nodes, including cases where the

nodes are heterogeneous (with different hardware or conﬁguration features).

•

Portable--

It must ultimately support jobs on clusters that have different operating systems or versions, different

architectures, different vendors, and different interconnect networks. Linux/CHAOS is, of course,

the intended ﬁrst home for this software, however.

Any LC resource manager must also meet two additional, locally important, requirements:

•

Compatible with LCRM (DPCS)--

Since a resource manager is not a complex scheduler nor a complete batch system with across-cluster

accounting and reporting features, it must support and work well within such a larger, more

comprehensive job-control framework. At LC, the Livermore Computing Resource Management

system (formerly called DPCS (URL: http://www.llnl.gov/LCdocs/dpcs)) provides that framework

(see also the next section (page 8)).

•

Compatible with QsNet--

Since LC's Linux Project has already reﬁned QsNet as its preferred high-speed interconnect for

Linux/CHAOS clusters, an adequate resource manager must also allocate Quadrics QsNet resources

along with compute nodes. But conversely, interconnect independence and the ability to easily support

other brands of interconnect (such as Myrinet) is important too. Such independence allows great

ﬂexibility in pursuing new hardware conﬁgurations in future clusters.

Finally, to ﬁt well into LC's emerging CHAOS environment, a resource manager should ideally have

these three very beneﬁcial extra properties as well:

SLURM Reference Manual - 6