Reference Guide

43 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

Figure 37: An example Tensorboard output

5.3 Slurm Scheduler

Section 5.1 describes the data scientist portal, but the current portal version can allocate resources only within

one node. To use resources on multiple nodes, the user can use Slurm job scheduler. The Slurm scheduler is

used to manage the resource allocation and job submissions for all users in a cluster. To use Slurm, the user

needs to submit a script on the cluster head node that specifies what resources are required and what job

should be executed on those resources once allocated.

Figure 38 shows an example Slurm script. In this example, the user asks for 2 nodes ( ) and 4 tasks per node

), resulting in 8 tasks in total ( ). One task implies one CPU process. Nodes that include

four GPUs are requested using the

The job itself the CUDA SDK. Since

running this job requires CUDA toolkit, the user also needs to load the module files for

that set the path and environment variables needed to execute the

test.

Figure 39

shows an example output before running sample.job. After checking enough resources are available, the script

to query the status

of running jobs. Figure 40

jobs. Note the output script name includes the Slurm job number that was visible in the squeue command

output. An example output file name for Figure 38 is slurm-27325.out where 27325 is an example job id. Figure

41 shows an example content of this file. The output file displays the and the output of

Note that although the script in Figure 38 allocated 8 processes, it only used one

process in the execution command. To use all processes, the user can use MPI or other multi-process

programming model.

Figure 38: An example Slurm job script named sample.job