Reference Guide

14 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
3 Deep Learning Training and Inference Performance and
Analysis
In this section, the performance of Deep Learning training as well as inference is measured using three open
source Deep Learning frameworks: TensorFlow
, MXNet
and Caffe2
. The experiments were conducted on an
instance of the solution architecture described in Section 2. The experiment test cluster used a PowerEdge
R740xd head node, and PowerEdge C4140 compute nodes, different storage sub-systems including Isilon and
InfiniBand EDR network. A detailed test bed description is provided in the following section.
3.1 Deep Learning Training
The well-known ILSVRC2012 dataset was used for benchmarking performance. This dataset contains
1,281,167 training images and 50,000 validation images in 140GB. All images are grouped into 1000 categories
or classes. The overall size of ILSVRC 2012 leads to non-trivial training times and thus makes it more interesting
for analysis. Additionally this dataset is commonly used by Deep Learning researchers for benchmarking and
comparison studies. Resnet50 is a computationally intensive network and was selected to stress the solution
to its maximum capability. For the batch size parameter in Deep Learning, the maximum batch size that does
not cause memory errors was selected; this translated to a batch size of 64 per GPU for MXNet and Caffe2,
and 128 per GPU for TensorFlow. Horovod, a distributed TensorFlow framework, was used to scale the training
across multiple compute nodes. Throughput this document, performance was measured using a metric of
images/sec which is a measure of throughput of how fast the system can complete training the dataset.
The images/sec result was averaged across all iterations to take into account the deviations. The total number
of iterations is equal to num_epochs*num_images/(batch_size*num_gpus), where num_epochs means the
number of passes to all images of a dataset, num_images means the total number of images in the dataset,
batch_size means the number of images that are processed in parallel by one GPU, and num_gpus means the
total number of GPUs involved in the training.
Before running any benchmark, the cache on the head node and compute node(s) were cleared with the
The training tests were run for a single epoch, or
one pass through the entire dataset, since the throughput is consistent through epochs for MXNet and
TensorFlow tests. Consistent throughput means that the performance variation was not significant across
iterations, the tests measured less than 2% variation in performance.
However, two epochs were used for Caffe2 as it needs two epochs to stabilize the performance. This is because
(throughput or
images/sec) is not stable (the performance variation between iterations is large) when the dataset is not fully
loaded in memory.
For MXNet framework, 16 CPU threads were used for dataset decoding and the reason was explained in the
Deep Learning on V100 . Caffe2 does not provide a parameter for users to set the number of CPU threads.
For TensorFlow, the number of CPU threads used for dataset decoding is calculated by subtracting four threads
per GPU from the total physical core count of the system. The four threads per GPU are used for GPU compute,
memory copies, event monitoring, and sending/receiving tensors. The processor used in these tests has 20
cores, with 40 cores per server. Hence 24 (40 threads 4 threads/GPU * 4 GPUs) threads were used for data
decoding per compute node for the TensorFlow tests. Table 5 lists the hardware and software details of the
testbed used for the results presented in the following sections.