HP XC System Software Release Notes for Version 4.0

8 Load Sharing Facility and Job Management

This chapter addresses the following topics:

• Load Sharing Facility (page 31)

• Job Management (page 31)

8.1 Load Sharing Facility

This section contains notes about LSF with SLURM on HP XC and standard LSF.

8.1.1 Some Commands Hang When LSF is Down

When LSF is down, commands like df and lsof might hang.

The hangs are caused because after a job runs, the sbatchd daemon automatically mounts the

/net/lsfhost.localdomain/hptc_cluster directory. When LSF is down, the

lsfhost.localdomain VIP is also down, resulting in the df waiting for the VIP to come up.

These are seen due to the autofs related features of RHEL.

To avoid this problem, append LSF_AM_OPTIONS=AMNEVER to the $LSF_ENVDIR/lsf.conf

file and run the badmin hrestart command to restart sbatchd. This prevents sbatchd from

mounting the /net/lsfhost.localdomain/hptc_cluster directory.

If you still see instances of other directories being mounted through autofs, disable the

autofs/automount features by commenting out the following lines from the /etc/

auto.master file on all nodes, and restart the autofs service:

/misc /etc/auto.misc

/net -hosts

8.2 Job Management

The notes in this section apply to job management and the Simple Linux Utility for Resource

Management (SLURM).

8.2.1 hptc_cluster_fs Package Failover Can Negatively Affect Running Jobs

In an HP XC cluster that is configured with LSF and SLURM, if jobs complete during the time

frame when SLURM is stopped or restarted during an hptc_cluster package failover operation,

the bjobs and squeue commands report those jobs as still running.

To work around this issue, you must manually stop those jobs to move the nodes from the alloc

to the idle state.

In addition, it is possible that you need to restart the SLURM and LSF services on both nodes in

the Serviceguard cluster running on the head node.

8.1 Load Sharing Facility 31