HP XC System Software Administration Guide Version 3.2

Ensure that the lsf partition is configured correctly.
Verify that the system licensing is operational. Use the lmstat -a command.
Ensure that munge is running on all compute nodes.
If you are experiencing LSF communication problems, examine for potential firewall issues.
When LSF-HPC with SLURM failover is disabled and the LSF execution host (which is not
the head node) goes down, issue the controllsf command to restart LSF-HPC with
SLURM on the HP XC system:
# controllsf start
When failover is enabled, you need to intervene only when the primary LSF execution host
is not started on HP XC system startup (when the startsys command is run). Use the
controllsf command to restart LSF-HPC with SLURM.
# controllsf start
When starting LSF-HPC with SLURM after a partial system shutdown, LSF is started on the
head node if:
LSF failover is enabled.
The head node has the resource management role and no other resource management
node is up.
The head node has the resource management role and the enable headnode
preferred subcommand is set.
LSF-HPC with SLURM was not shut down cleanly, perhaps as a result of running
startsys without running service lsf stop or controllsf stop on the head
node.
LSF-HPC with SLURM starts on the head node if the other resource management nodes
are unavailable
LSF-HPC with SLURM failover may select the node that it just released.
LSF-HPC with SLURM failover attempts to ensure that a different node is used after it
removes control from the present node. However, if all other options are exhausted, LSF-HPC
with SLURM failover tries the current node again before giving up.
If you are trying to perform load balancing, log in to the primary LSF execution host node
and execute the controllsf start here command from that node.
Rebooting a node might result in inconclusive job termination.
If a node that is running a job under LSF-HPC with SLURM is rebooted (with the reboot
command), SLURM might recognize the node as unresponsive and attempt to end the job.
However, some remnants of the job could remain, which causes LSF to report the job as still
running. This issue has occurred with large jobs using in excess of 100 nodes.
If you turn off power to the node power instead of rebooting it, however, LSF-HPC with
SLURM reports the status as EXIT, and the node is released back to the pool of idle nodes.
An LSF Queue RUN_WINDOW that is too short can suspend other jobs.
A job that does not complete within the RUN_WINDOW of its queue is suspended and that
might prevent other jobs on other queues from running, even if those other jobs were
submitted to a higher priority queue.
At the next instance of the queue's RUN_WINDOW, the job resumes execution and the other
jobs can be scheduled.
Consider this example:
1. Job #75 is scheduled on a queue named night.
2. The RUN_WINDOW opens for the night queue.
3. Job #75 runs on the night queue.
264 Troubleshooting