HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

261

262

263

264

265

266

267

268

269

270

• Ensure that the lsf partition is configured correctly.

• Verify that the system licensing is operational. Use the lmstat -a command.

• Ensure that munge is running on all compute nodes.

• If you are experiencing LSF communication problems, examine for potential firewall issues.

• When LSF-HPC with SLURM failover is disabled and the LSF execution host (which is not

the head node) goes down, issue the controllsf command to restart LSF-HPC with

SLURM on the HP XC system:

# controllsf start

• When failover is enabled, you need to intervene only when the primary LSF execution host

is not started on HP XC system startup (when the startsys command is run). Use the

controllsf command to restart LSF-HPC with SLURM.

# controllsf start

• When starting LSF-HPC with SLURM after a partial system shutdown, LSF is started on the

head node if:

— LSF failover is enabled.

— The head node has the resource management role and no other resource management

node is up.

— The head node has the resource management role and the enable headnode

preferred subcommand is set.

— LSF-HPC with SLURM was not shut down cleanly, perhaps as a result of running

startsys without running service lsf stop or controllsf stop on the head

node.

— LSF-HPC with SLURM starts on the head node if the other resource management nodes

are unavailable

• LSF-HPC with SLURM failover may select the node that it just released.

LSF-HPC with SLURM failover attempts to ensure that a different node is used after it

removes control from the present node. However, if all other options are exhausted, LSF-HPC

with SLURM failover tries the current node again before giving up.

If you are trying to perform load balancing, log in to the primary LSF execution host node

and execute the controllsf start here command from that node.

• Rebooting a node might result in inconclusive job termination.

If a node that is running a job under LSF-HPC with SLURM is rebooted (with the reboot

command), SLURM might recognize the node as unresponsive and attempt to end the job.

However, some remnants of the job could remain, which causes LSF to report the job as still

running. This issue has occurred with large jobs using in excess of 100 nodes.

If you turn off power to the node power instead of rebooting it, however, LSF-HPC with

SLURM reports the status as EXIT, and the node is released back to the pool of idle nodes.

• An LSF Queue RUN_WINDOW that is too short can suspend other jobs.

A job that does not complete within the RUN_WINDOW of its queue is suspended and that

might prevent other jobs on other queues from running, even if those other jobs were

submitted to a higher priority queue.

At the next instance of the queue's RUN_WINDOW, the job resumes execution and the other

jobs can be scheduled.

Consider this example:

1. Job #75 is scheduled on a queue named night.

2. The RUN_WINDOW opens for the night queue.

3. Job #75 runs on the night queue.

264 Troubleshooting