HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

181

182

183

184

185

186

187

188

189

190

MaxSendRetryDelay The maximum number of seconds to pause before sending

an accounting message. The actual delay is a random value

between 1 and this value. The default value is 5 seconds.

StaggerSlotSize Generally, the increment of time a process pauses before

sending its message. For n tasks, an equal number of

staggered time slots are defined in increments of

(StaggerSlotSize * 0.001) seconds. The first task sends

its message immediately; the second task pauses one

increment before sending its message; the third task pauses

two increments before sending its message; and so on. The

default value of this parameter is 1.

If you change the values of any of these parameters, assign them in a comma-separated

horizontal list in quotation marks, as shown here:

JobAcctParameters="Frequency=10,MaxSendRetries=5,StaggerSlotSize=2"

f. Verify that this portion of the slurm.conf file resembles the following (the changes

are shown in bold):

# o Define the job accounting mechanism

JobAcctType=jobacct/log

# o Define the location where job accounting logs are to

# be written. For

# - jobacct/none - this parameter is ignored

# - jobacct/log - the fully-qualified file name

# for the data file

JobAcctLoc=/hptc_cluster/slurm/job/jobacct.log

JobAcctParameters="Frequency=10"

g. Save the file.

5. Restart the slurmctld and slurmd daemons:

# cexec -a "service slurm restart"

15.5 Monitoring SLURM

The SLURM squeue, sinfo, and scontrol commands and the Nagios system monitoring

utility provide the means for monitoring and controlling SLURM on your HP XC system.

For status at a glance, the Nagios system monitor provides a global view of your system and

includes details about the state of SLURM. Chapter 8 (page 105)provides information about

Nagios on the HP XC system.

You can run the scontrol utility to confirm that your control daemons are active. In the following

example, node n5, which runs the primary slurmctld, and node n8, which runs the backup,

are both up.

# scontrol ping

Slurmctld(primary/backup) at n5/n8 are UP/UP

The sinfo command reports the status of both nodes and partitions. Consider this example:

# sinfo --all

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

lsf up infinite 122 idle n[5-16,18-127]

182 Managing SLURM