Reference Guide

18 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

The P2P memory access time speedup with MXNet and Caffe2 was measured to be 3.7x and 3.2x, respectively,

when using NVLink over PCIe across four GPUs in FP32 mode. However, the P2P memory accesses make up

only a small portion of the whole application time, the overall application performance improvement with V100-

SMX2 is a more modest 5-20% over V100-PCIe as shown in Figure 7. In Figure 7, up to four GPUs are within

one node, and 8 GPUs are in two nodes. The performance improvement percentage is higher with increasing

number of GPUs within one node. As we cross past a single node, the performance improvement may drop

since the GPU to GPU communication in some instances has to go past the PCIe and the EDR InfiniBand link

apart from the NVLINK traffic. Hence the GPU communication time overall would be higher. This performance

delta between PCIe and SXM2 is expected to increase if the GPU-GPU communication in a neural network

model implementation increases (Ex: model parallel instead of data parallel), and as software implementations

take further advantage of the P2P architecture. Given the performance advantage of the SMX2 modules today,

and the potential for further improvements with SMX2, this solution recommends the SMX2 GPUs over PCIe.

Figure 7: Performance improvement of V100-SXM2 over V100-PCIe for Resnet50 with ILSVRC2012

3.1.4 Scaling Performance with Multi-GPU

Figure 8 shows the scaling performance and speedup of one to eight V100-SXM2 GPUs. The multiple GPUs

are either in single compute node (up to four GPUs) or across two compute nodes. Two PowerEdge C4140

nodes were used with 8 GPUs in total.

As discussed in Section 3.1.1, there are two popular modes in Deep Learning training, FP32 and FP16, and

both modes were examined in this test. The time consuming part of the training phase are the matrix operations.

In FP32 mode, all floating point numbers use the standard 32-bit representation. In FP16 mode, however, some

floating point numbers are represented with only 16-bit. It can be seen that, across the three frameworks and

when using 8 GPUs, MXNet scales the best with 7.8x speedup in FP32 mode and 7.1x speedup in FP16 mode.

The speedup of TensorFlow is 7.1x in FP32 and 6.8x in FP16, respectively. For Caffe2, the speedup is 7.4x in

FP32 and 6.9x in FP16, respectively. The results indicate that the Deep Learning training performance scales

well across more than one compute node with this solution. The next test examines a multi-node solution.