HP XC System Software Administration Guide Version 3.2

MaxSendRetryDelay The maximum number of seconds to pause before sending
an accounting message. The actual delay is a random value
between 1 and this value. The default value is 5 seconds.
StaggerSlotSize Generally, the increment of time a process pauses before
sending its message. For n tasks, an equal number of
staggered time slots are defined in increments of
(StaggerSlotSize * 0.001) seconds. The first task sends
its message immediately; the second task pauses one
increment before sending its message; the third task pauses
two increments before sending its message; and so on. The
default value of this parameter is 1.
If you change the values of any of these parameters, assign them in a comma-separated
horizontal list in quotation marks, as shown here:
JobAcctParameters="Frequency=10,MaxSendRetries=5,StaggerSlotSize=2"
f. Verify that this portion of the slurm.conf file resembles the following (the changes
are shown in bold):
.
.
.
#
# o Define the job accounting mechanism
#
JobAcctType=jobacct/log
#
# o Define the location where job accounting logs are to
# be written. For
# - jobacct/none - this parameter is ignored
# - jobacct/log - the fully-qualified file name
# for the data file
#
JobAcctLoc=/hptc_cluster/slurm/job/jobacct.log
JobAcctParameters="Frequency=10"
.
.
.
g. Save the file.
5. Restart the slurmctld and slurmd daemons:
# cexec -a "service slurm restart"
15.5 Monitoring SLURM
The SLURM squeue, sinfo, and scontrol commands and the Nagios system monitoring
utility provide the means for monitoring and controlling SLURM on your HP XC system.
For status at a glance, the Nagios system monitor provides a global view of your system and
includes details about the state of SLURM. Chapter 8 (page 105)provides information about
Nagios on the HP XC system.
You can run the scontrol utility to confirm that your control daemons are active. In the following
example, node n5, which runs the primary slurmctld, and node n8, which runs the backup,
are both up.
# scontrol ping
Slurmctld(primary/backup) at n5/n8 are UP/UP
The sinfo command reports the status of both nodes and partitions. Consider this example:
# sinfo --all
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lsf up infinite 122 idle n[5-16,18-127]
182 Managing SLURM