SLURM Reference Manual for HP XC System Software

Table Of Contents

SLURM Roles

SLURM ﬁlls a crucial but mostly hidden role in running large parallel programs on large clusters.

Most users who run batch jobs at LC use job-control utilities (such as PSUB or PALTER) that talk to

the Livermore Computing Resource Management system (LCRM, formerly called DPCS), LC's locally

designed metabatch system. LCRM:

•

Provides a common user interface for batch-job submittal across all LC machines and clusters.

•

Monitors resource use across machines and clusters.

•

Implements bank-based fair-share scheduling policy, again, across all LC production machines.

To carry out its scheduling decisions, LCRM relies on the native resource manager on each machine

or cluster where it assigns batch jobs to run. The basic duties of such a native resource manager are to:

•

Get and share information on resource (chieﬂy node) availability.

•

Allocate compute resources (chieﬂy, nodes or processors).

•

Shepard jobs as their tasks execute.

On IBM AIX machines, LoadLeveler traditionally served as the native resource manager. On LC's

nonAIX machines, LCRM has relied on one of three other native resource managers to provide low-level

job control:

•

RMS (Resource Management System), used on "capability" clusters (devoted to one or two users at

a time).

•

TBS (Trivial Batch System, an LC-developed replacement for the formerly widespread Network

Queueing System or NQS).

•

SLURM (introduced here for managing Linux clusters and still evolving to meet speciﬁc LC needs).

The key differences among these alternatives appear in this table:

SLURMTBSRMS

No, open sourceNo, open sourceYesProprietary?

Interconnect independentInterconnect independentMachines with QsNet

interconnect

Used on:

Either with CHAOSCapacity clustersCapability clustersSuited for:

Either possibleMultiple jobs per nodeWhole nodes allocated

to jobs

Node allocation:

SLURM Reference Manual - 8