HP XC System Software Administration Guide Version 3.2

Table 16-1 LSF-HPC with SLURM Interpretation of SLURM Node States (continued)
DescriptionNode
A node in any of the following states:In Use
The node is allocated to a job.
ALLOCATED
The node is allocated to a job that is in
the process of completing. The node
state is removed when all the job
processes have ended and the SLURM
epilog program (if any) has ended.
COMPLETING
The node is currently running a job but
will not be allocated to additional jobs.
The node state changes to state
DRAINED when the last job on it
completes.
DRAINING
A node that is not available for use; its status is one of the following:Unavailable
The node is not available for use.
DOWNED
The node is not available for use per
system administrator request.
DRAINED
The SLURM controller has just started
and the node state is not yet
determined.
UNKNOWN
16.2.1.4 LSF-HPC with SLURM Failover
The failover of the LSF component of the integrated LSF-HPC with SLURM product is of critical
concern because only one node in the HP XC system runs the LSF-HPC with SLURM daemons.
During installation, you select the primary LSF execution host from the nodes on the HP XC
system that have the resource management role; although that node could also be a compute
node, it is not recommended. Other nodes that also have the resource management role are
designated as potential LSF execution host backups.
To address this concern, LSF-HPC with SLURM is configured on HP XC with a virtual host name
(vhost) and a virtual IP (vIP). The virtual IP and host name are used because they can be moved
from one node to another, and maintain a consistent LSF interface. By default, the virtual IP is
an internal IP on the HP XC administration network, and the virtual host name is
lsfhost.localdomain. The LSF execution host is configured to host the vIP, then the LSF-HPC
with SLURM daemons are started on that node.
The Nagios infrastructure contains a module that monitors the LSF-HPC with SLURM virtual
IP. If it detects a problem with the virtual IP (for example, the inability to ping it), the monitoring
code assumes the node is down and chooses a new LSF execution host from the backup candidate
nodes on which to set up the virtual IP and restart LSF-HPC with SLURM.
See “LSF-HPC with SLURM Failover” (page 203) for more information.
16.3 Switching the Type of LSF Installed
The HP XC system installation process offers a choice of two different types of LSF. The default
choice is LSF-HPC with SLURM. This choice requires that SLURM is installed and configured
when you run the cluster_config utility. Standard LSF-HPC is the second type of LSF that
is available to install, and it does not interact with SLURM.
If you made the wrong LSF selection while running the cluster_config utility, perform the
following procedure to remove the current type of LSF installed and install the other type of
LSF:
1. Log in as superuser (root) on the head node.
194 Managing LSF