White Papers

Dell - Internal Use - Confidential

cuDNN library

Version 5.1.3

Intel Compiler

Version 2017.0.098

Python

2.7.5

Deep Learning Frameworks

NV-Caffe

Version 0.15.13

Intel-Caffe

Version 1.0.0-rc3

MXNet

Version 0.7.0

TensorFlow

Version 0.11.0-rc2

We measured the metrics of both images/sec and training time.

The images/sec is the measurement for training speed while the training time is the wall clock time for

training, I/O operation and other overhead. The images/sec number was obtained from “samples/sec” in

MXNet and TensorFlow in the output log files. NV-Caffe listed “M s/N iter” as output which means M

seconds were taken to process N iterations, or N batches. The metric images/sec was calculated by

“batch_size*N/M”. The batch size is the number of training samples in one forward/backward pass

through all layers of a neural network. The images/sec number was averaged across all iterations to take

into account the deviations.

The training time was obtained from “Time cost” in MXNet output logs. For NV-Caffe and TensorFlow,

their output log files contained the wall-clock timestamps during the whole training. So the time

difference from the start to the end of the training was calculated as the training time.

Since NV-Caffe did not support distributed training, it was not executed on multiple nodes. The MXNet

framework was able to run on multiple nodes. However, the caveat was that it could only use the Ethernet

interface (10 Gb/s) on the compute nodes by default, and therefore the performance was not as high as

expected. To solve this issue, we have manually changed its source code so that the high-speed InfiniBand

interface (EDR 100 Gb/s) was used. The training with TensorFlow on multiple nodes was able to run but

with poor performance and the reason is still under investigation.

Table 2 shows the input parameters used in different deep learning frameworks. In all deep learning

frameworks, the neural network training requires many epochs or iterations. Whether the term epoch or

iteration is used is determined by each framework. An epoch is a complete pass through all samples in a

given dataset, while one iteration processes only one batch of samples. Therefore, the relationship

between iterations and epochs is: epochs = (iterations*batch_size)/training_samples. Every framework

only needs either epochs or iterations so that another parameter can be easily determined by this formula.

Since our goal was to measure the performance and scalability of Dell’s server and not to train an end-to-

end image classification model, the training was a subset of the full model training which was large enough

to reflect performance. Therefore we chose a smaller number of epochs or iterations so that they could

finish in a reasonable time. Although only partial training was performed, the training speed (images/sec)

remained relatively constant over this period.