Reference Guide

32 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

4 Containers for Deep Learning

Deep Learning frameworks tend to be complex to install and build with a myriad of library dependencies. These

frameworks and their requisite libraries are under constant development, which makes test environment and

test result reproducibility a challenge for researchers. Another layer of complexity is that most enterprise data

centers use Red Hat Enterprise Linux (or its derivatives) whereas Ubuntu is the default target for most Deep

Learning frameworks.

Containerization technology has surged in popularity as it is a powerful tool for handling the three issues just

described portable test environment, reduced dependency on the underlying operating system and better test

result reproducibility. A container packs all the environment and libraries an application needs into an image file

and this container can be deployed without any additional changes. It also allows users to easily create,

distribute and destroy a container image. Compared to Virtual Machines, containers are lightweight with less

overhead. In this study, Singularity, a container designed specifically for use in HPC environments, is used to

containerize different Deep Learning frameworks. The results presented in this section demonstrate that the

containerized version can achieve the same performance as a bare-metal install while simplifying the build and

deployment of Deep Learning frameworks.

4.1 Singularity Containers

Singularity is developed by Lawrence Berkeley National Laboratory to provide container technology specifically

for HPC. It enables applications to be encapsulated in an isolated virtual environment to simplify application

deployment. Unlike virtual machines, the container does not have a virtual hardware layer and its own Linux

kernel inside the host operating system (OS), therefore the overhead and the performance loss are minimal.

The main goal of the container is reproducibility. The container has all environment and libraries an application

needs to run, and it is portable so that other users can reproduce the results the container creator generated

for that application. To use Singularity container, the user only needs to load its module using the command

inside a Slurm script which will be described in Section 5.3.

Many HPC applications, especially Deep Learning applications, have extensive library dependencies and it is

time consuming to solve these dependencies and debug build issues. Most Deep Learning frameworks are

developed in Ubuntu but they need to be deployed to Red Hat Enterprise Linux (RHEL). It is therefore beneficial

to build those applications once in a container and then deploy them anywhere. The most important goal of

Singularity is portability which means once a Singularity container is created, the container should be able to

run on any system. However, there may be kernel dependencies to consider if a user needs to leverage any

kernel specific functionality (e.g. OFED). Usually a user would build a container on a laptop or a server, a cluster

or a cloud, and then deploy that container on a server, a cluster or a cloud.

When building a container, one challenge is when using GPU-based systems. If GPU drivers are installed inside

the container, and the driver version does not match the host GPU driver, then an error will occur. Hence the

container should always use the host GPU driver. The next option is to bind the paths of the GPU driver binary

file and libraries to the container so that these paths are visible to the container. However, if the container OS

is different than the host OS, such binding may have problems. For instance, assume the container OS is

Ubuntu while the host OS is RHEL, and on the host the GPU driver binaries are installed in and the

driver libraries are installed in . Note that the container OS also has and ;

therefore, if we bind those paths from the host to the container, the other binaries and libraries inside the

container may not work anymore because they may not be compatible across different Linux distributions. One

workaround is to move all those driver related files to a new central directory location that does not exist in the

container and then bind that central location.