HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

201

202

203

204

205

206

207

208

209

210

Note:

At least two nodes must have the resource management roles to enable LSF-HPC with SLURM

failover. One is selected as the master (primary LSF execution host), and the others are considered

backup nodes. At any time,LSF-HPC with SLURM daemons start and run only on the master

node.

The Nagios LSF failover module monitors the virtual IP associated with the primary LSF execution

host. When LSF-HPC with SLURM failover is enabled on the HP XC system, and if the primary

LSF execution host fails, the Nagios LSF failover module detects that the node is unresponsive

and initiates failover:

• The Nagios module attempts to contact the node hosting the IP to ensure that LSF-HPC with

SLURM is shut down and that virtual IP hosting is disabled.

• A new primary LSF execution host from the backup nodes is selected. The LSF daemons

start on the backup node.

• The Nagios module tries to re-establish the virtual IP on the new node.

• LSF-HPC with SLURM is restarted on that host.

LSF-HPC with SLURM monitoring and failover are implemented on the HP XC system as tools

that prepare the environment for the LSF execution host daemons on a given node, start the

daemons, then watch the node to ensure that it remains active.

After a standard installation, the HP XC system is initially configured so that:

• LSF-HPC with SLURM is started on the head node.

• LSF-HPC with SLURM failover is disabled.

• The Nagios application reports whether LSF-HPC with SLURM is up, down, or "currently

shut down," but takes no action in any case.

The only direct interaction between LSF-HPC with SLURM and the LSF monitoring and failover

tools occurs at LSF-HPC with SLURM startup, when the daemons are started in the virtual

environment, and at failover, when the existing daemons are shut down cleanly before the virtual

environment is moved to a new host.

16.14.2 Interplay of LSF-HPC with SLURM

The LSF-HPC with SLURM product and SLURM are managed independently; one is not critically

affected if the other goes down.

SLURM has no dependency on LSF-HPC with SLURM.

The LSF-HPC with SLURM product needs SLURM to schedule jobs. If SLURM becomes

unresponsive, LSF-HPC with SLURM drops its processor count to 1 and closes the HP XC virtual

host. When SLURM is available again, LSF-HPC with SLURM adjusts its processor count

accordingly and reopens the host.

16.14.3 Assigning the Resource Management Nodes

You assign nodes to both SLURM and the LSF-HPC with SLURM product by assigning them

the resource management role; this role includes both the lsf and slurm_controller services.

By default, the HP XC resource management system attempts to place the SLURM controller

and the LSF execution host on the same node to constrain the use of system resources. If only

one node has the resource management role, the LSF-HPC with SLURM execution daemons and

the SLURM control daemon both run on that node.

If two nodes are assigned the resource management role, by default, the first node becomes the

primary resource management node, and the second node is the backup resource management

node.

204 Managing LSF