SLURM Reference Manual for HP XC System Software

SLURM Roles
SLURM lls a crucial but mostly hidden role in running large parallel programs on large clusters.
Most users who run batch jobs at LC use job-control utilities (such as PSUB or PALTER) that talk to
the Livermore Computing Resource Management system (LCRM, formerly called DPCS), LC's locally
designed metabatch system. LCRM:
Provides a common user interface for batch-job submittal across all LC machines and clusters.
Monitors resource use across machines and clusters.
Implements bank-based fair-share scheduling policy, again, across all LC production machines.
To carry out its scheduling decisions, LCRM relies on the native resource manager on each machine
or cluster where it assigns batch jobs to run. The basic duties of such a native resource manager are to:
Get and share information on resource (chiey node) availability.
Allocate compute resources (chiey, nodes or processors).
Shepard jobs as their tasks execute.
On IBM AIX machines, LoadLeveler traditionally served as the native resource manager. On LC's
nonAIX machines, LCRM has relied on one of three other native resource managers to provide low-level
job control:
RMS (Resource Management System), used on "capability" clusters (devoted to one or two users at
a time).
TBS (Trivial Batch System, an LC-developed replacement for the formerly widespread Network
Queueing System or NQS).
SLURM (introduced here for managing Linux clusters and still evolving to meet specic LC needs).
The key differences among these alternatives appear in this table:
SLURMTBSRMS
No, open sourceNo, open sourceYesProprietary?
Interconnect independentInterconnect independentMachines with QsNet
interconnect
Used on:
Either with CHAOSCapacity clustersCapability clustersSuited for:
Either possibleMultiple jobs per nodeWhole nodes allocated
to jobs
Node allocation:
SLURM Reference Manual - 8