Reference Guide

33 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
The second solution is to implement the above workaround inside the container so that the container can use
the driver related files automatically. This feature has already been implemented in the development branch of
Singularity repository. A user simply ne -- owever, a
cluster environment typically installs GPU driver in a shared file system instead of the default local path on all
nodes, and in this case, Singularity is unable to find the GPU driver path since the driver is not installed in the
default or common paths (e.g. , , , etc.). Even if the container is able to find the
GPU driver and the corresponding driver libraries and the container is built successfully, the host driver version
on the target system must be updated enough to support the GPU libraries which were linked to the application
when building the container, else an error will occur due to outdated and incompatible versions between the
host system and the container. Given the backward compatibility of GPU drivers, the burden is on the cluster
system administrators to keep GPU drivers up to date to ensure the cluster GPU libraries are equal to or newer
than the versions of the GPU libraries used when building the container.
Another challenge is when using InfiniBand with the containers because the InfiniBand driver is kernel
dependent. There should be no issues if the container OS and host OS are similar or compatible. For instance,
RHEL and Centos are compatible, and Debian and Ubuntu are compatible. But if these two OSs are not
compatible, then there will be library compatibility issues if the container attempts to use the host InfiniBand
driver and libraries. If the InfiniBand driver is installed inside the container, then the drivers in the container and
the host might not be compatible since the InfiniBand driver is kernel dependent and the container and the host
share the same kernel. If the container and host have different InfiniBand drivers, then a conflict will occur. The
Singularity community is working to solve this InfiniBand issue. The current solution is to ensure the container
OS and host OS are compatible and allow the container to reuse the InfiniBand driver and libraries on the host.
These are only workarounds. The container community is still pushing hard to make containers portable with
ease across platforms.
Figure 19 shows the performance comparison between runs inside Singularity and directly on bare-metal for
Horovod+TensorFlow, MXNet and Caffe2 frameworks. All the hardware and software used in this benchmarking
are the same as used in Section 3.1.4 and described in Table 5. The percentage number in the figure denotes
the percentage difference between performance using Singularity and bare-metal ((singularity performance
bare metal performance) / bare metal performance). It can be seen that the maximum performance difference
between Singularity and bare-metal is within 1.9% which is in a reasonable run-to-run variation range.
(a) Horovod+TensorFlow