HP XC System Software Administration Guide Version 3.2

/hptc_cluster/slurm/etc/slurm.conf; it is not obtained directly from the nodes. See the
SLURM documentation for more information on configuring the slurm.conf file.
16.13 LSF-HPC with SLURM Monitoring
LSF-HPC with SLURM is monitored and controlled by Nagios using the check_lsf plug-in.
When LSF-HPC with SLURM is down, the response of the check_lsf plug-in depends on
whether LSF-HPC with SLURM failover is enabled or disabled:
When LSF-HPC with SLURM failover is disabled
The check_lsf plug-in returns an immediate failure notification to Nagios.
When LSF-HPC with SLURM failover is enabled
The check_lsf plug-in decides if LSF-HPC with SLURM is supposed to be running. If so,
it acquires a list of resource management nodes and tries to restart LSF-HPC with SLURM
on each of those nodes, in turn, until one succeeds, or until the list is exhausted.
If successful, the check_lsf plug-in returns an LSF OK - restarted message.
If the restart procedure fails, the check_lsf plug-in returns a failure notification.
LSF Execution Host Failure
If the node hosting LSF-HPC with SLURM becomes unresponsive, the Nagios check_lsf plug-in
takes action.
Table 16-2 lists the Nagios messages for LSF failover monitor status:
Table 16-2 Nagios Messages for LSF-HPC with SLURM Failover Monitor Status
MeaningMessage
The LSF-HPC with SLURM environment appears to be
up and operational on the HP XC system
LSF OK - up
The LSF-HPC with SLURM environment has not been
started on the HP XC system
LSF OK - currently shut down
LSF-HPC with SLURM is not running, and LSF-HPC with
SLURM failover is disabled
LSF CRITICAL - down
The LSF-HPC with SLURM environment was not running,
and should have been; it is being restarted. The message
changes to LSF OK - up the next time Nagios is
updated.
LSF warning - restarted
An abnormal problem occurred. The {message} text
provides useful diagnostic information.
LSF CRITICAL - {message}
16.14 LSF-HPC with SLURM Failover
This section discusses aspects of the LSF-HPC with SLURM failover mechanism.
16.14.1 Overview of LSF-HPC with SLURM Monitoring and Failover Support
LSF-HPC with SLURM failover is disabled by default. You can enable or disable LSF-HPC with
SLURM failover at any time with the controllsf command. For more information, see
controllsf(8).
16.13 LSF-HPC with SLURM Monitoring 203