HP XC System Software Administration Guide Version 3.2

Note:
At least two nodes must have the resource management roles to enable LSF-HPC with SLURM
failover. One is selected as the master (primary LSF execution host), and the others are considered
backup nodes. At any time,LSF-HPC with SLURM daemons start and run only on the master
node.
The Nagios LSF failover module monitors the virtual IP associated with the primary LSF execution
host. When LSF-HPC with SLURM failover is enabled on the HP XC system, and if the primary
LSF execution host fails, the Nagios LSF failover module detects that the node is unresponsive
and initiates failover:
The Nagios module attempts to contact the node hosting the IP to ensure that LSF-HPC with
SLURM is shut down and that virtual IP hosting is disabled.
A new primary LSF execution host from the backup nodes is selected. The LSF daemons
start on the backup node.
The Nagios module tries to re-establish the virtual IP on the new node.
LSF-HPC with SLURM is restarted on that host.
LSF-HPC with SLURM monitoring and failover are implemented on the HP XC system as tools
that prepare the environment for the LSF execution host daemons on a given node, start the
daemons, then watch the node to ensure that it remains active.
After a standard installation, the HP XC system is initially configured so that:
LSF-HPC with SLURM is started on the head node.
LSF-HPC with SLURM failover is disabled.
The Nagios application reports whether LSF-HPC with SLURM is up, down, or "currently
shut down," but takes no action in any case.
The only direct interaction between LSF-HPC with SLURM and the LSF monitoring and failover
tools occurs at LSF-HPC with SLURM startup, when the daemons are started in the virtual
environment, and at failover, when the existing daemons are shut down cleanly before the virtual
environment is moved to a new host.
16.14.2 Interplay of LSF-HPC with SLURM
The LSF-HPC with SLURM product and SLURM are managed independently; one is not critically
affected if the other goes down.
SLURM has no dependency on LSF-HPC with SLURM.
The LSF-HPC with SLURM product needs SLURM to schedule jobs. If SLURM becomes
unresponsive, LSF-HPC with SLURM drops its processor count to 1 and closes the HP XC virtual
host. When SLURM is available again, LSF-HPC with SLURM adjusts its processor count
accordingly and reopens the host.
16.14.3 Assigning the Resource Management Nodes
You assign nodes to both SLURM and the LSF-HPC with SLURM product by assigning them
the resource management role; this role includes both the lsf and slurm_controller services.
By default, the HP XC resource management system attempts to place the SLURM controller
and the LSF execution host on the same node to constrain the use of system resources. If only
one node has the resource management role, the LSF-HPC with SLURM execution daemons and
the SLURM control daemon both run on that node.
If two nodes are assigned the resource management role, by default, the first node becomes the
primary resource management node, and the second node is the backup resource management
node.
204 Managing LSF