Reference Guide

14 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

3 Deep Learning Training and Inference Performance and

Analysis

In this section, the performance of Deep Learning training as well as inference is measured using three open

source Deep Learning frameworks: TensorFlow

, MXNet

and Caffe2

. The experiments were conducted on an

instance of the solution architecture described in Section 2. The experiment test cluster used a PowerEdge

R740xd head node, and PowerEdge C4140 compute nodes, different storage sub-systems including Isilon and

InfiniBand EDR network. A detailed test bed description is provided in the following section.

3.1 Deep Learning Training

The well-known ILSVRC2012 dataset was used for benchmarking performance. This dataset contains

1,281,167 training images and 50,000 validation images in 140GB. All images are grouped into 1000 categories

or classes. The overall size of ILSVRC 2012 leads to non-trivial training times and thus makes it more interesting

for analysis. Additionally this dataset is commonly used by Deep Learning researchers for benchmarking and

comparison studies. Resnet50 is a computationally intensive network and was selected to stress the solution

to its maximum capability. For the batch size parameter in Deep Learning, the maximum batch size that does

not cause memory errors was selected; this translated to a batch size of 64 per GPU for MXNet and Caffe2,

and 128 per GPU for TensorFlow. Horovod, a distributed TensorFlow framework, was used to scale the training

across multiple compute nodes. Throughput this document, performance was measured using a metric of

images/sec which is a measure of throughput of how fast the system can complete training the dataset.

The images/sec result was averaged across all iterations to take into account the deviations. The total number

of iterations is equal to num_epochs*num_images/(batch_size*num_gpus), where num_epochs means the

number of passes to all images of a dataset, num_images means the total number of images in the dataset,

batch_size means the number of images that are processed in parallel by one GPU, and num_gpus means the

total number of GPUs involved in the training.

Before running any benchmark, the cache on the head node and compute node(s) were cleared with the

The training tests were run for a single epoch, or

one pass through the entire dataset, since the throughput is consistent through epochs for MXNet and

TensorFlow tests. Consistent throughput means that the performance variation was not significant across

iterations, the tests measured less than 2% variation in performance.

However, two epochs were used for Caffe2 as it needs two epochs to stabilize the performance. This is because

(throughput or

images/sec) is not stable (the performance variation between iterations is large) when the dataset is not fully

loaded in memory.

For MXNet framework, 16 CPU threads were used for dataset decoding and the reason was explained in the

Deep Learning on V100 . Caffe2 does not provide a parameter for users to set the number of CPU threads.

For TensorFlow, the number of CPU threads used for dataset decoding is calculated by subtracting four threads

per GPU from the total physical core count of the system. The four threads per GPU are used for GPU compute,

memory copies, event monitoring, and sending/receiving tensors. The processor used in these tests has 20

cores, with 40 cores per server. Hence 24 (40 threads 4 threads/GPU * 4 GPUs) threads were used for data

decoding per compute node for the TensorFlow tests. Table 5 lists the hardware and software details of the

testbed used for the results presented in the following sections.