White Papers

Deep Learning Performance: Scale-up vs Scale-out

Architectures & Technologies Dell EMC | Infrastructure Solutions Group

CPU1

X16 Port 3

AB CD

X16 Port 2

CD AB

X16 Port 1

CD AB

x16

Ethernet

CPU2

X16 Port 2

AB CD

X16 Port 1

AB CD

X16 Port 3

AB CD

x16

GPU-

DWFL

300W

PERC

PCIe

Slot

x16

GPU-

DWFL

300W

SAS SSD

GPU-

DWFL

300W

UPI

Figure 11: Dell PowerEdge R740/R740xd

6 Framework Setup Details

6.1 Distributed Horovod-TensorFlow Setup

Horovod [8] [9] [10] is a distributed training framework for TensorFlow, Keras and PyTorch

initially developed by Uber. It uses bandwidth-optimal communication protocols (RDMA) [2]

In this section, we explain briefly the software stack configuration we used to extract the

performance throughput in multi-node using distributed Horovod TensorFlow and using high

speed Mellanox InfiniBand ConnectX-5 network adapter with 100Gbit/s over IPoIB, and

GPUDirect RDMA.

To setup the configuration, we used as our reference the configuration procedure presented by

Mellanox on its community blog space [3] and the basic installation of Horovod in Docker [4].