Reference Guide

43 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
Figure 37: An example Tensorboard output
5.3 Slurm Scheduler
Section 5.1 describes the data scientist portal, but the current portal version can allocate resources only within
one node. To use resources on multiple nodes, the user can use Slurm job scheduler. The Slurm scheduler is
used to manage the resource allocation and job submissions for all users in a cluster. To use Slurm, the user
needs to submit a script on the cluster head node that specifies what resources are required and what job
should be executed on those resources once allocated.
Figure 38 shows an example Slurm script. In this example, the user asks for 2 nodes ( ) and 4 tasks per node
), resulting in 8 tasks in total ( ). One task implies one CPU process. Nodes that include
four GPUs are requested using the
).
The job itself the CUDA SDK. Since
running this job requires CUDA toolkit, the user also needs to load the module files for
that set the path and environment variables needed to execute the
test.
Figure 39
shows an example output before running sample.job. After checking enough resources are available, the script
to query the status
of running jobs. Figure 40
jobs. Note the output script name includes the Slurm job number that was visible in the squeue command
output. An example output file name for Figure 38 is slurm-27325.out where 27325 is an example job id. Figure
41 shows an example content of this file. The output file displays the and the output of
Note that although the script in Figure 38 allocated 8 processes, it only used one
process in the execution command. To use all processes, the user can use MPI or other multi-process
programming model.
Figure 38: An example Slurm job script named sample.job