Reference Guide

17 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

GPU Max Clock rate (MHz) 1481 1530

Tensor Cores N/A 640

Memory Bandwidth (GB/s) 732 900

NVLink Uni-direction Bandwidth (GB/s)

80 150

Double Precision (TFLOPS) 5.1 7.5

Deep Learning (Tensor OPS) 0 120

TDP (Watts) 300 300

Figure 6: Performance comparison between V100-SXM2 and P100-SXM2 for Resnet50 with

ILSVRC2012 within one node (four GPUs)

3.1.3 V100-SXM2 vs V100-PCIe

V100-SXM2 GPUs are recommended over V100-PCIe GPUs in the Deep Learning solution described in this

document. When multi-GPU are used, V100-SXM2 has the advantage of using the faster NVLink for GPU to

GPU communication over V100-PCIe which uses PCIe for GPU communication. As described in Section 2.2.1,

each V100-SXM2 GPU has 6 NVLinks for bi-directional communication. The bandwidth of each NVLink is

25GB/s in uni-direction and all 4 GPUs within a node can communicate at the same time, therefore the

theoretical peak bandwidth is 6*25*4=600GB/s in bi-direction. However, the theoretical peak bandwidth using

PCIe is only 16*2=32GB/s as the GPUs can only communicate in order. So in theory the data communication

with NVLink could be up to 600/32=18x faster than PCIe.

To evaluate the performance advantage of using NVLink over PCIe, the Peer-to-Peer (P2P, which is GPU-to-

GPU within the same compute node) memory access time when running three Deep Learning frameworks was

nvprof. It was found that TensorFlow implements P2P without an

explicit call to the cudaMemcpyPeer () API but nvprof profiles P2P communication based on this API, therefore

there is no way to profile the P2P speedup for TensorFlow with nvprof.