White Papers

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
19
CPU1
X16 Port 3
AB CD
X16 Port 2
CD AB
X16 Port 1
CD AB
x16
x16
Ethernet
CPU2
X16 Port 2
AB CD
X16 Port 1
AB CD
X16 Port 3
AB CD
x16
x16
GPU-
DWFL
300W
PERC
PCIe
Slot
x16
GPU-
DWFL
300W
SAS SSD
GPU-
DWFL
300W
UPI
Figure 11: Dell PowerEdge R740/R740xd
6 Framework Setup Details
6.1 Distributed Horovod-TensorFlow Setup
Horovod [8] [9] [10] is a distributed training framework for TensorFlow, Keras and PyTorch
initially developed by Uber. It uses bandwidth-optimal communication protocols (RDMA) [2]
In this section, we explain briefly the software stack configuration we used to extract the
performance throughput in multi-node using distributed Horovod TensorFlow and using high
speed Mellanox InfiniBand ConnectX-5 network adapter with 100Gbit/s over IPoIB, and
GPUDirect RDMA.
To setup the configuration, we used as our reference the configuration procedure presented by
Mellanox on its community blog space [3] and the basic installation of Horovod in Docker [4].