Reference Guide

20 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

Figure 8: The scaling performance of Deep Learning training on V100-SXM2

To demonstrate the scalability with more than two compute nodes, the same Deep Learning training

benchmarks were executed on a solution with eight PowerEdge C4140 nodes with four V100-PCIe GPUs per

server. The multi-node test bed available at the time of writing provided PCIe based GPUs. Given that the

performance difference between PCIe and SMX2 GPUs is well understood (between 5-20% as presented in

Section 3.1.3), the PCIe GPUs were considered a reasonable test to understand the scalability of the

frameworks. It is expected that the SMX2 GPUs will demonstrate similar scalability patterns.

The results are presented in Figure 9. Since eight compute node are used, there are 32 GPUs in total. It can

be seen that among the three frameworks, with two nodes and eight GPUs, the speedup in all three frameworks

are similar to the speedup with V100-SXM2 GPUs (presented in Figure 8). When using 32 GPUs, MXNet scales

the best with 29.4x speedup in FP32 mode and 25.8x in FP16 mode. TensorFlow also scales well with 22.0x

speedup in FP32 mode and 23.7x in FP16 mode. For Caffe2, the speedup is 26.5x in FP32 and 27.3x in FP16.