HP XC System Software Release Notes for Version 4.0

8 Load Sharing Facility and Job Management
This chapter addresses the following topics:
Load Sharing Facility (page 31)
Job Management (page 31)
8.1 Load Sharing Facility
This section contains notes about LSF with SLURM on HP XC and standard LSF.
8.1.1 Some Commands Hang When LSF is Down
When LSF is down, commands like df and lsof might hang.
The hangs are caused because after a job runs, the sbatchd daemon automatically mounts the
/net/lsfhost.localdomain/hptc_cluster directory. When LSF is down, the
lsfhost.localdomain VIP is also down, resulting in the df waiting for the VIP to come up.
These are seen due to the autofs related features of RHEL.
To avoid this problem, append LSF_AM_OPTIONS=AMNEVER to the $LSF_ENVDIR/lsf.conf
file and run the badmin hrestart command to restart sbatchd. This prevents sbatchd from
mounting the /net/lsfhost.localdomain/hptc_cluster directory.
If you still see instances of other directories being mounted through autofs, disable the
autofs/automount features by commenting out the following lines from the /etc/
auto.master file on all nodes, and restart the autofs service:
/misc /etc/auto.misc
/net -hosts
8.2 Job Management
The notes in this section apply to job management and the Simple Linux Utility for Resource
Management (SLURM).
8.2.1 hptc_cluster_fs Package Failover Can Negatively Affect Running Jobs
In an HP XC cluster that is configured with LSF and SLURM, if jobs complete during the time
frame when SLURM is stopped or restarted during an hptc_cluster package failover operation,
the bjobs and squeue commands report those jobs as still running.
To work around this issue, you must manually stop those jobs to move the nodes from the alloc
to the idle state.
In addition, it is possible that you need to restart the SLURM and LSF services on both nodes in
the Serviceguard cluster running on the head node.
8.1 Load Sharing Facility 31