Reference Guide

20 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
(c) Caffe2 across 8 GPUs using the ILSVRC2012 dataset with Resnet50
Figure 8: The scaling performance of Deep Learning training on V100-SXM2
To demonstrate the scalability with more than two compute nodes, the same Deep Learning training
benchmarks were executed on a solution with eight PowerEdge C4140 nodes with four V100-PCIe GPUs per
server. The multi-node test bed available at the time of writing provided PCIe based GPUs. Given that the
performance difference between PCIe and SMX2 GPUs is well understood (between 5-20% as presented in
Section 3.1.3), the PCIe GPUs were considered a reasonable test to understand the scalability of the
frameworks. It is expected that the SMX2 GPUs will demonstrate similar scalability patterns.
The results are presented in Figure 9. Since eight compute node are used, there are 32 GPUs in total. It can
be seen that among the three frameworks, with two nodes and eight GPUs, the speedup in all three frameworks
are similar to the speedup with V100-SXM2 GPUs (presented in Figure 8). When using 32 GPUs, MXNet scales
the best with 29.4x speedup in FP32 mode and 25.8x in FP16 mode. TensorFlow also scales well with 22.0x
speedup in FP32 mode and 23.7x in FP16 mode. For Caffe2, the speedup is 26.5x in FP32 and 27.3x in FP16.