Dell EMC Ready Solutions for AI with NVIDIA Deep Learning Deep Learning with NVIDIA Architecture Guide Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula Abstract There has been an explosion of interest in Deep Learning and the plethora of choices makes designing a solution complex and time consuming.
Revisions Date Description August 2018 Initial release publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. © August 2018 v1.0 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.
Table of Contents Revisions............................................................................................................................................................................. 2 Table of Contents ................................................................................................................................................................ 3 Executive summary...............................................................................................................
Executive summary Deep Learning techniques has enabled great success in many fields such as computer vision, natural language processing (NLP), gaming and autonomous driving by enabling a model to learn from existing data and then to make corresponding predictions. The success is due to a combination of improved algorithms, access to large datasets and increased computational power.
1 Solution Overview Dell EMC has developed an architecture for Deep Learning that provides a complete, supported solution. This solution includes carefully selected technologies across all aspects of Deep Learning, processing capabilities, memory, storage and network technologies as well as the software ecosystem. This complete solution is provided as s for AI Deep Learning with NVIDIA.
Section 2 describes each of these solution components in more detail, covering the compute, network, storage and software configurations. Extensive performance analysis on this solution was conducted in the HPC and AI Innovation Lab and those results are presented in Section 3. These includes tests with training and inference workloads, conducted on different types of GPUs, using different floating point and integer precision arithmetic, and with different storage sub-systems for Deep Learning workloads.
2 Solution Architecture The hardware comprises of a cluster with a master node, compute nodes, shared storage and networks. The master node or head node roles can include deploying the cluster of compute nodes, managing the compute nodes, user logins and access, providing a compilation environment, and job submissions to compute nodes. The compute nodes are the work horse and execute the submitted jobs.
Table 1: PowerEdge R740xd configurations Component Details Server Model PowerEdge R740xd Processor 2 x Intel Xeon Gold 6148 CPU @ 2.40GHz Memory 24 x 16GB DDR4 2666MT/s DIMMs - 384GB Disks 12 x 12TB NL SAS RAID 50 (Recommended 10+ drives) I/O & Ports Network daughter card with 2 x 10GE + 2 x 1GE 2.1.
Figure 3: The topology of a compute node Table 2: PowerEdge C4140 Configurations Component Details Server Model PowerEdge C4140 Processor 2 x Intel Xeon Gold 6148 CPU @ 2.40GHz Memory 24 x 16GB DDR4 2666MT/s DIMMs - 384GB Local Disks 120GB SSD, 1.6TB NVMe I/O & Ports Network daughter card with 2 x 10GE + 2 x 1GE 2.2.
Tensor Cores 640 640 Memory Bandwidth (GB/s) 900 900 NVLink Bandwidth (GB/s) (uni-direction) N/A 300 Deep Learning (Tensor OPS) 112 120 TDP (Watts) 250 300 Tesla V100 product line includes two variations, V100-PCIe and V100-SXM2. The comparison of two variants is shown in Table 3. In the V100-PCIe, all GPUs communicate with each other over PCIe buses. With the V100SXM2 model, all GPUs are connected by NVIDIA NVLink.
replaced immediately, a DIMM module from a compute node can be temporarily considered to restore the head node until replacement modules arrive. Figure 4: Relative memory bandwidth for different system capacities 2.5 Isilon Storage Dell EMC Isilon is a proven scale-out network attached storage (NAS) solution that can handle the unstructured data prevalent in many different workflows. The Isilon storage architecture automatically aligns application needs with performance, capacity, and economics.
the Isilon with applications installed on the local NFS. The performance comparison between Isilon and other storage solutions are shown in Section 3.1.6. The specifications of the Isilon F800 are listed in Table 4.
The third switch in the solution is called a gateway switch in Figure 2 and connects the Isilon F800 to the head nal interfaces are 40 Gigabit Ethernet. Hence, a switch which can serve as the gateway between the 40GbE Ethernet and InfiniBand networks is needed for connectivity to the head and compute nodes. The Mellanox SX6036 is used for this purpose. The gateway is connected to the InfiniBand EDR switch and the Isilon as shown in Figure 2. 2.
3 Deep Learning Training and Inference Performance and Analysis In this section, the performance of Deep Learning training as well as inference is measured using three open source Deep Learning frameworks: TensorFlow , MXNet and Caffe2 . The experiments were conducted on an instance of the solution architecture described in Section 2. The experiment test cluster used a PowerEdge R740xd head node, and PowerEdge C4140 compute nodes, different storage sub-systems including Isilon and InfiniBand EDR network.
Table 5: The hardware and software in the testbed Hardware Head node Cluster head node PowerEdge R740xd CPU 2 x Intel Xeon 6148 @ 2.4GHz Memory 384GB DDR4 @ 2667MT/s Disks on head node 12 x12 TB Near-line SAS drives in a RAID 50 volume. 120TB volume formatted as XFS, exported via NFS Hardware Compute node Cluster compute node PowerEdge C4140 Number of compute nodes 8 nodes with V100-PCIe and 2 nodes with V100-SXM2 CPU 2 x Intel Xeon 6148 @ 2.4GHz Memory 384GB DDR4 @ 2667MT/s Disks 2x M.
training. This section compares the performance of using FP16 for training versus FP32. In experiments where training tests were executed using FP16 precision, the batch size was doubled since FP16 consumes only half the memory for floating points as FP32. Doubling the batch size with FP16 ensures that GPU memory is utilized equally for both types of tests. The performance comparison of FP16 versus FP32 is shown in Figure 5 for all three frameworks used in this study.
GPU Max Clock rate (MHz) 1481 1530 Tensor Cores N/A 640 Memory Bandwidth (GB/s) 732 900 NVLink Uni-direction Bandwidth (GB/s) 80 150 Double Precision (TFLOPS) 5.1 7.5 Deep Learning (Tensor OPS) 0 120 TDP (Watts) 300 300 Figure 6: Performance comparison between V100-SXM2 and P100-SXM2 for Resnet50 with ILSVRC2012 within one node (four GPUs) 3.1.3 V100-SXM2 vs V100-PCIe V100-SXM2 GPUs are recommended over V100-PCIe GPUs in the Deep Learning solution described in this document.
The P2P memory access time speedup with MXNet and Caffe2 was measured to be 3.7x and 3.2x, respectively, when using NVLink over PCIe across four GPUs in FP32 mode. However, the P2P memory accesses make up only a small portion of the whole application time, the overall application performance improvement with V100SMX2 is a more modest 5-20% over V100-PCIe as shown in Figure 7. In Figure 7, up to four GPUs are within one node, and 8 GPUs are in two nodes.
(a) Horovod+TensorFlow across 8 GPUs using the ILSVRC2012 dataset with Resnet50 (b) MXNet across 8 GPUs using the ILSVRC2012 dataset with Resnet50 19 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.
(c) Caffe2 across 8 GPUs using the ILSVRC2012 dataset with Resnet50 Figure 8: The scaling performance of Deep Learning training on V100-SXM2 To demonstrate the scalability with more than two compute nodes, the same Deep Learning training benchmarks were executed on a solution with eight PowerEdge C4140 nodes with four V100-PCIe GPUs per server. The multi-node test bed available at the time of writing provided PCIe based GPUs.
(a) Horovod+TensorFlow across 32 GPUs using the ILSVRC2012 dataset with Resnet50 (b) MXNet across 32 GPUs using the ILSVRC2012 dataset with Resnet50 (c) Caffe2 across 32 GPUs for Resnet50 using the ILSVRC2012 dataset Figure 9: The scaling performance of Deep Learning training on V100-PCIe 3.1.5 Storage Performance The impact of different storage sub-system options for the Deep Learning Solution was evaluated next.
NFS on Near-Line SAS drives hosted by the head node and exported over NFS via IPoIB NFS on SSD drives hosted by the head node and exported over NFS via IPoIB NVMe The Isilon F800 evaluated in this experiment has the same configuration as described in Section 2.5. Before clear the cache on all four Isilon F800 nodes.
stress the I/O to be able to demonstrate notable difference in performance between the storage options. But, it turns out that the performance with the TFRecrods database is much better than using raw JPEG images. The performance delta between these two different image formats is much larger in the less complicated neural networks like AlexNet. For instance, the performance advantage when using TFRecords over raw JPEG images is ~23%-40% for VGG16, but around 5x (500%!) for AlexNet.
(c) VGG16 Figure 10: Neural network training performance with different storages systems and image database options. The batch size is 256, 256 and 128 for AlexNet, Resnet50 and VGG16, respectively. All training are in FP16 mode. The training performance of the tested neural networks are affected by different flags in the benchmark. It was found that the training performance of Resnet50 four. There was no obvious performance improvement for VGG16 and AlexNet.
Figure 12: The profiled disk throughput with InsightIQ when running Resnet50 with Isilon To better understand the underlying storage system, the Isilon storage I/O performance was profiled using Isilon InsightIQ while training the model. The Isilon InsightIQ was described in 2.5. Only the Isilon storage with TFRecords dataset was profiled since all storage systems displayed similar performance and the lessons from one profiling exercise should be broadly applicable for this use case.
cached into the system memory on one compute node. The cache was not cleared after the benchmark stopped, that is why the used memory is closed to zero but cache still does not decrease. Although the memory cannot cache the whole dataset, the network and file system are fast enough to feed data to GPUs to keep them stay at high utilization and to keep the training speed at 2,940 images/sec. The average GPU utilization was around 380% across four GPUs.
On Isilon F800, the network throughput is around 300 MB/s to maintain high GPU utilization. This is the data transfer throughput of Resnet50 training and it almost does not change during the whole training. The network bandwidth is 56 Gb/s (7 GB/s) between the IB EDR Switch and the FDR-40 GigE Gateway as shown in Figure 2. Two more InfiniBand network connections can be added (three connections in total) to match the bandwidth of four Isilon nodes, which is 4 x 40 Gb/s.
There are ongoing studies to further stress the storage subsystems with other models (model parallelism like seq2seq) and datasets to understand the performance implications of deep learning workloads at scale. 3.2 Deep Learning Inference Inference is the end goal of Deep Learning. The inference performance tends to be either latency-focused or throughput-focused. On one hand, latency-focused scenarios are time sensitive (e.g.
Figure 15: Inference performance with INT8 vs FP32 for Resnet50 model Figure 15 also illustrates the performance difference when using different batch sizes. It can be seen that without batch processing the inference throughput is much lower. This is because the GPU is not assigned enough work in each iteration to keep it busy. The larger the batch size, the higher the inference throughput, although this advantage begins to flatten as batch size increases. The largest batch size is limited by GPU memory.
ResNet-152 74.90% 92.21% 74.84% 92.16% 0.06% 0.05% VGG-16 68.35% 88.45% 68.30% 88.42% 0.05% 0.03% VGG-19 68.47% 88.46% 68.38% 88.42% 0.09% 0.03% GoogLeNet 68.95% 89.12% 68.77% 89.00% 0.18% 0.12% AlexNet 56.82% 79.99% 56.79% 79.94% 0.03% 0.06% Figure 16: Resnet50 inference performance on V100 vs P100 (V100 is 3.
Figure 17: How to initialize DIGITS on the Deep Learning Solution Once the DIGITS server is up and running, a web browser can be used to navigate to the home page of DIGITS. As shown in Figure 18, the DIGITS home page can be accessed from . To know how to use DIGITS in more details, refer to Chapter 2 which includes an example on handwritten digits recognition using a Caffe backend. Figure 18: DIGITS home page 31 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.
4 Containers for Deep Learning Deep Learning frameworks tend to be complex to install and build with a myriad of library dependencies. These frameworks and their requisite libraries are under constant development, which makes test environment and test result reproducibility a challenge for researchers. Another layer of complexity is that most enterprise data centers use Red Hat Enterprise Linux (or its derivatives) whereas Ubuntu is the default target for most Deep Learning frameworks.
The second solution is to implement the above workaround inside the container so that the container can use the driver related files automatically. This feature has already been implemented in the development branch of Singularity repository.
(b) MXNet (c) Caffe2 Figure 19: Singularity container vs bare-metal for Resnet50 using ILSVRC2012 dataset 4.2 Running NVIDIA GPU Cloud with the Ready Solutions for AI - Deep Learning NVIDIA GPU Cloud (NGC) is a cloud that hosts the Docker container images of many Deep Learning frameworks and HPC applications. The Deep Learning frameworks include Caffe, Caffe2, CNTK, TensorFlow, MXNet, Theano, and so on. These frameworks are optimized for NVIDIA GPUs.
Docker is a very prominent containerization technology used avidly in the context of deep learning frameworks. There are several pros and cons of Docker and a comparison between Docker and Singularity containers is shown Singularity Containers for HPC & Deep Learning NVIDIA added the GPU support in Docker. Docker is not installed by default in this solution. However, the administrator is expected to simply run utility provided by Bright Cluster Manager.
Figure 21: NVIDIA GPU Cloud registry page In Figure 21, clicking shown in Figure 22. on the top right will take the user to the configuration page as will generate an API key that is specific to the registered user. Figure 22: NVIDIA GPU Cloud configuration page 36 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.
A user can download the Docker image of a framework directly with the command in Figure 23. Figure 23: Steps to download a Docker image from NGC If the user wants to use Singularity container instead of Docker containers, Figure 24 shows how to create a Singularity image from the Docker image downloaded from NGC. The user must provide a set of environment to convert a Docker image to Singularity image. execute the script within the created Singularity container.
5 The Data Scientist Portal The data scientist portal was developed by Dell EMC and it simplifies the users allowing model implementation, training and inference tests to be run in Jupyter Notebook or directly from the Linux terminal. The Jupyter Notebook allows a user to write explanatory text and intersperse it with raw codes and the tables and figures that those codes generate. It allows a user to directly execute the codes that are embedded in the document that is being created or has been created.
After clicki the user is forwarded to a list of instance options (Figure 27). The user is given options to choose how many GPUs to be shared. For each allocated GPU, 8 CPU cores and 48GB system memory are allocated. And by default the session time is allocated to 8 hours. Figure 27: Portal instance screen After the resources are allocated, the portal goes to the landing screen (Figure 28).
launch a Tensorboard. Figure 31 is a screenshot after launching a terminal. How to use Tensorboard is shown in Section 5.2. Figure 30: Portal kernel list Figure 31: Portal terminal kernel classification example shown in Figure 32 shown in Figure 33, the training of the handwritten digits classification will start. An example output is shown in Figure 34. Figure 32: TensorFlow notebook 40 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.
Figure 33: TensorFlow notebook handwritten digits classification Figure 34: TensorFlow notebook handwritten digits classification output After a user starts a kernel, the kernel will keep running until being stopped. To stop a kernel, the user needs to click Figure 30, then the page will go to the page in Figure 35 41 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.
Figure 35: Stop the server 5.2 Tensorboard Integration Besides the TensorFlow framework, the portal also provides the Tensorboard visualization tool. The Tensorboard is used to visualize the TensorFlow computation graph, plot quantitative metrics about the execution of a graph, and show some other additional data. To use Tensorboard, the user needs to use TensorFlow FileWriter API to serialize the wanted data into a directory.
Figure 37: An example Tensorboard output 5.3 Slurm Scheduler Section 5.1 describes the data scientist portal, but the current portal version can allocate resources only within one node. To use resources on multiple nodes, the user can use Slurm job scheduler. The Slurm scheduler is used to manage the resource allocation and job submissions for all users in a cluster.
Figure 39: An example output of the command sinfo before running sample.job Figure 40: An example output of the command squeue showing the job status 44 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.
Figure 41: An example output for the after running sample.job. The file name is slurm-27325.out. 45 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.
6 Conclusions and Future Work This document describes the first integrated Dell EMC Ready Solutions for AI - Deep Learning with NVIDIA. The goal of this solution is to provide a complete, tuned and supported solution for Deep Learning training and inference use cases. The solution takes into account the ideal compute, storage, network, and software configuration for this workload.