Reference Guide

32 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
4 Containers for Deep Learning
Deep Learning frameworks tend to be complex to install and build with a myriad of library dependencies. These
frameworks and their requisite libraries are under constant development, which makes test environment and
test result reproducibility a challenge for researchers. Another layer of complexity is that most enterprise data
centers use Red Hat Enterprise Linux (or its derivatives) whereas Ubuntu is the default target for most Deep
Learning frameworks.
Containerization technology has surged in popularity as it is a powerful tool for handling the three issues just
described portable test environment, reduced dependency on the underlying operating system and better test
result reproducibility. A container packs all the environment and libraries an application needs into an image file
and this container can be deployed without any additional changes. It also allows users to easily create,
distribute and destroy a container image. Compared to Virtual Machines, containers are lightweight with less
overhead. In this study, Singularity, a container designed specifically for use in HPC environments, is used to
containerize different Deep Learning frameworks. The results presented in this section demonstrate that the
containerized version can achieve the same performance as a bare-metal install while simplifying the build and
deployment of Deep Learning frameworks.
4.1 Singularity Containers
Singularity is developed by Lawrence Berkeley National Laboratory to provide container technology specifically
for HPC. It enables applications to be encapsulated in an isolated virtual environment to simplify application
deployment. Unlike virtual machines, the container does not have a virtual hardware layer and its own Linux
kernel inside the host operating system (OS), therefore the overhead and the performance loss are minimal.
The main goal of the container is reproducibility. The container has all environment and libraries an application
needs to run, and it is portable so that other users can reproduce the results the container creator generated
for that application. To use Singularity container, the user only needs to load its module using the command
inside a Slurm script which will be described in Section 5.3.
Many HPC applications, especially Deep Learning applications, have extensive library dependencies and it is
time consuming to solve these dependencies and debug build issues. Most Deep Learning frameworks are
developed in Ubuntu but they need to be deployed to Red Hat Enterprise Linux (RHEL). It is therefore beneficial
to build those applications once in a container and then deploy them anywhere. The most important goal of
Singularity is portability which means once a Singularity container is created, the container should be able to
run on any system. However, there may be kernel dependencies to consider if a user needs to leverage any
kernel specific functionality (e.g. OFED). Usually a user would build a container on a laptop or a server, a cluster
or a cloud, and then deploy that container on a server, a cluster or a cloud.
When building a container, one challenge is when using GPU-based systems. If GPU drivers are installed inside
the container, and the driver version does not match the host GPU driver, then an error will occur. Hence the
container should always use the host GPU driver. The next option is to bind the paths of the GPU driver binary
file and libraries to the container so that these paths are visible to the container. However, if the container OS
is different than the host OS, such binding may have problems. For instance, assume the container OS is
Ubuntu while the host OS is RHEL, and on the host the GPU driver binaries are installed in and the
driver libraries are installed in . Note that the container OS also has and ;
therefore, if we bind those paths from the host to the container, the other binaries and libraries inside the
container may not work anymore because they may not be compatible across different Linux distributions. One
workaround is to move all those driver related files to a new central directory location that does not exist in the
container and then bind that central location.