White Papers

Dell - Internal Use - Confidential
cuDNN library
Version 5.1.3
Intel Compiler
Version 2017.0.098
Python
2.7.5
Deep Learning Frameworks
NV-Caffe
Version 0.15.13
Intel-Caffe
Version 1.0.0-rc3
MXNet
Version 0.7.0
TensorFlow
Version 0.11.0-rc2
We measured the metrics of both images/sec and training time.
The images/sec is the measurement for training speed while the training time is the wall clock time for
training, I/O operation and other overhead. The images/sec number was obtained from “samples/sec” in
MXNet and TensorFlow in the output log files. NV-Caffe listed “M s/N iter” as output which means M
seconds were taken to process N iterations, or N batches. The metric images/sec was calculated by
batch_size*N/M. The batch size is the number of training samples in one forward/backward pass
through all layers of a neural network. The images/sec number was averaged across all iterations to take
into account the deviations.
The training time was obtained from “Time cost” in MXNet output logs. For NV-Caffe and TensorFlow,
their output log files contained the wall-clock timestamps during the whole training. So the time
difference from the start to the end of the training was calculated as the training time.
Since NV-Caffe did not support distributed training, it was not executed on multiple nodes. The MXNet
framework was able to run on multiple nodes. However, the caveat was that it could only use the Ethernet
interface (10 Gb/s) on the compute nodes by default, and therefore the performance was not as high as
expected. To solve this issue, we have manually changed its source code so that the high-speed InfiniBand
interface (EDR 100 Gb/s) was used. The training with TensorFlow on multiple nodes was able to run but
with poor performance and the reason is still under investigation.
Table 2 shows the input parameters used in different deep learning frameworks. In all deep learning
frameworks, the neural network training requires many epochs or iterations. Whether the term epoch or
iteration is used is determined by each framework. An epoch is a complete pass through all samples in a
given dataset, while one iteration processes only one batch of samples. Therefore, the relationship
between iterations and epochs is: epochs = (iterations*batch_size)/training_samples. Every framework
only needs either epochs or iterations so that another parameter can be easily determined by this formula.
Since our goal was to measure the performance and scalability of Dell’s server and not to train an end-to-
end image classification model, the training was a subset of the full model training which was large enough
to reflect performance. Therefore we chose a smaller number of epochs or iterations so that they could
finish in a reasonable time. Although only partial training was performed, the training speed (images/sec)
remained relatively constant over this period.