White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Ready Solutions Engineering Test Results

the property of their respective owners. Published in the USA. Dell EMC believes the information in this document is accurate as of its publication date. The information is

subject to change without notice.

Scaling Deep Learning on Multiple V100 Nodes

Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.

HPC Innovation Lab. November 2017

Abstract

In our previous blog, we presented the deep learning performance on single Dell PowerEdge C4130 node with four V100 GPUs. For very

large neural network models, a single node is still not powerful enough to quickly train those models. Therefore, it is important to scale

the training model to multiple nodes to meet the computation demand. In this blog, we will evaluate the multi-node performance of deep

learning frameworks MXNet and Caffe2. The results will show that both frameworks scale well on multiple V100-SXM2 nodes.

Overview of MXNet and Caffe2

In this section, we will give an overview about how MXNet and Caffe2 are implemented for distributed training on multiple nodes. Usually

there are two ways to parallelize neural network training on multiple devices: data parallelism and model parallelism. In data parallelism,

all devices have the same model but different devices work on different pieces of data. While in model parallelism, difference devices

have parameters of different layers of a neural network. In this blog, we only focus on the data parallelism in deep learning frameworks

and will evaluate the model parallelism in the future. Another choice in most deep learning frameworks is whether to use synchronous or

asynchronous weight update. The synchronous implementation aggregates the gradients over all workers in each iteration (or mini-batch)

before updating the weights. However, in asynchronous implementation, each worker updates the weight independently with each other.

Since the synchronous way guarantees the model convergence while the asynchronous way is still an open question, we only evaluate

the synchronous weight update.

MXNet is able to launch jobs on a cluster in several ways including SSH, Yarn, MPI. For this evaluation, SSH was chosen. In SSH mode,

the processes in different nodes use rsync to synchronize the working directory from root node into slave nodes. The purpose of

synchronization is to aggregate the gradients over all workers in each iteration (or mini-batch). Caffe2 uses Gloo library for multi-node

training and Redis library to facilitate management of nodes in distributed training. Gloo is a MPI like library that comes with a number of

collective operations like barrier, broadcast and allreduce for machine learning applications. The Redis library used by Gloo is used to

connect all participating nodes.

Testing Methodology

We chose to evaluate two deep learning frameworks for our testing, MXNet and Caffe2. As with our previous benchmarks, we will again

use the ILSVRC 2012 dataset which contains 1,281,167 training images and 50,000 validation images. The neural network in the training

is called Resnet50 which is a computationally intensive network that both frameworks support. To get the best performance, CUDA 9

compiler, CUDNN 7 library and NCCL 2.0 are used for both frameworks, since they are optimized for V100 GPUs. The testing platform

has four Dell EMC’s PowerEdge C4130 servers in configuration K. The system layout of configuration K is shown in Figure 1. As we can

see, the server has four V100-SXM2 GPUs and all GPUs are connected by NVLink. The other hardware and software details are shown

in Table 1. Table 2 shows the input parameters that are used to train Resnet50 neural network in both frameworks.

Summary of content (4 pages)

PAGE 1
Ready Solutions Engineering Test Results 1 Scaling Deep Learning on Multiple V100 Nodes Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula. HPC Innovation Lab. November 2017 Abstract In our previous blog, we presented the deep learning performance on single Dell PowerEdge C4130 node with four V100 GPUs. For very large neural network models, a single node is still not powerful enough to quickly train those models.
PAGE 2
Figure 1: C4130 configuration K Table 1: The hardware configuration and software details Platform PowerEdge C4130 (configuration K) CPU 2 x Intel Xeon E5-2690 v4 @2.6GHz (Broadwell) Memory 256GB DDR4 @ 2400MHz Disk 9TB HDD GPU V100-SXM2 Software and Firmware Operating System RHEL 7.3 x86_64 Linux Kernel 3.10.0-514.26.2.el7.x86_64 BIOS 2.4.2 CUDA compiler and GPU driver CUDA 9.0 (384.81) Python 2.7.5 Interconnect OFED 4.1-1.0.2 Deep Learning Libraries and Frameworks CUDNN 7.0 NCCL 2.
PAGE 3
is only 29.26% faster than using FP32. We are still investigating the exact reason of it, but one possible explanation is that 12 is not the power of 2, which may make some operations like reductions slower. 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 18.0 15.4 16.0 14.0 11.5 7.7 6.8 1.0 2.0 2.0 1.0 3.8 13.8 12.0 10.0 10.3 8.0 Speedup Images/sec MXNet Resnet50 on Multi-node 6.0 3.5 4.0 2.0 0.
PAGE 4
is still experimental, our evaluation is in progress and the results will be included in future blogs. We are also working on containerizing these frameworks with Singularity to make their deployment much easier.