Reference Guide

10 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

Tensor Cores 640 640

Memory Bandwidth (GB/s) 900 900

NVLink Bandwidth (GB/s) (uni-direction) N/A 300

Deep Learning (Tensor OPS) 112 120

TDP (Watts) 250 300

Tesla V100 product line includes two variations, V100-PCIe and V100-SXM2. The comparison of two variants

is shown in Table 3. In the V100-PCIe, all GPUs communicate with each other over PCIe buses. With the V100-

SXM2 model, all GPUs are connected by NVIDIA NVLink. In use-cases where multiple GPUs are required, the

V100-SXM2 models provide the advantage of faster GPU-to-GPU communication over the NVLINK

interconnect when compared to PCIe. V100-SXM2 GPUs provide six NVLinks per GPU for bi-directional

communication. The bandwidth of each NVLink is 25GB/s in uni-direction and all four GPUs within a

node can communicate at the same time, therefore the theoretical peak bandwidth is 6*25*4=600GB/s

in bi-direction. However, the theoretical peak bandwidth using PCIe is only 16*2=32GB/s as the GPUs can only

communicate in order, which means the communication cannot be done in parallel. So in theory the data

communication with NVLink could be up to 600/32=18x faster than PCIe. The evaluation of this

performance advantage in real models will be discussed in Section 3.1.3. Because of this advantage, the

PowerEdge C4140 compute node in the Deep Learning solution uses V100-SXM2 instead of V100-PCIe GPUs.

2.3 Processor recommendation for Head Node and Compute Nodes

The processor chosen for the head node and compute nodes is Intel

Xeon

Gold 6148 CPU. This is the latest

Intel

Xeon

Scalable processor with 20 physical cores which support 40 threads. Previous studies, as

described in Section 3.1, have concluded that 16 threads are sufficient to feed the I/O pipeline for the state-of-

the-art convolutional neural network, so the Gold 6148 CPU is a reasonable choice. Additionally this CPU model

is recommended for the compute nodes as well, making this a consistent choice across the cluster.

2.4 Memory recommendation for Head Node and Compute Nodes

The recommended memory for the head node is 24x 16GB 2666MT/s DIMMs. Therefore the total size of

memory is 384GB. This is chosen based on the following facts:

Capacity: An ideal configuration must support system memory capacity that is larger than the total size

of GPU memory. Each compute node has 4 GPUs and each GPU has 16GB memory, so the system

memory must be at least 16GBx4=64GB. The head node memory also affects I/O performance. For

NFS service, larger memory will reduce disk read operations since NFS service needs to send out data

from memory. 16GB DIMMs demonstrate the best performance/dollar value.

DIMM configuration: Choices like 24x 16GB or 12x 32GB will provide the same capacity of 384GB

system memory, but according to our studies as shown in Figure 4, the combination of 24x 16GB

DIMMs provides 11% better performance than using 12x 32GB. The results shown here was on the

Intel Xeon Platinum 8180 processor, but the same trends will apply across other models in the Intel

Scalable Processor Family including the Gold 6148, although the actual percentage differences across

configurations may vary. More details can be found in our Skylake memory study.

Serviceability: The head node and compute nodes memory configurations are designed to be similar

to reduce parts complexity while satisfying performance and capacity needs. Fewer parts need to be

stocked for replacement, and in urgent cases if a memory module in the head node needs to be