Reference Guide

17 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
GPU Max Clock rate (MHz) 1481 1530
Tensor Cores N/A 640
Memory Bandwidth (GB/s) 732 900
NVLink Uni-direction Bandwidth (GB/s)
80 150
Double Precision (TFLOPS) 5.1 7.5
Deep Learning (Tensor OPS) 0 120
TDP (Watts) 300 300
Figure 6: Performance comparison between V100-SXM2 and P100-SXM2 for Resnet50 with
ILSVRC2012 within one node (four GPUs)
3.1.3 V100-SXM2 vs V100-PCIe
V100-SXM2 GPUs are recommended over V100-PCIe GPUs in the Deep Learning solution described in this
document. When multi-GPU are used, V100-SXM2 has the advantage of using the faster NVLink for GPU to
GPU communication over V100-PCIe which uses PCIe for GPU communication. As described in Section 2.2.1,
each V100-SXM2 GPU has 6 NVLinks for bi-directional communication. The bandwidth of each NVLink is
25GB/s in uni-direction and all 4 GPUs within a node can communicate at the same time, therefore the
theoretical peak bandwidth is 6*25*4=600GB/s in bi-direction. However, the theoretical peak bandwidth using
PCIe is only 16*2=32GB/s as the GPUs can only communicate in order. So in theory the data communication
with NVLink could be up to 600/32=18x faster than PCIe.
To evaluate the performance advantage of using NVLink over PCIe, the Peer-to-Peer (P2P, which is GPU-to-
GPU within the same compute node) memory access time when running three Deep Learning frameworks was
nvprof. It was found that TensorFlow implements P2P without an
explicit call to the cudaMemcpyPeer () API but nvprof profiles P2P communication based on this API, therefore
there is no way to profile the P2P speedup for TensorFlow with nvprof.