White Papers

Dell - Internal Use - Confidential
The batch size is one of the hyperparameters the user needs to tune when training a neural network
model with mini-batch Stochastic Gradient Descent (SGD). The batch size in the table are commonly used
sizes. Whether these batch sizes are optimized for model accuracy is left in future work. For all neural
networks in all frameworks, we increased the batch size proportionally with increasing number of GPUs.
In the meantime, the number of iterations was adjusted so that the total number of samples was fixed no
matter how many GPUs were used. Since epoch has nothing to do with batch size, its value was not
changed when a different number of GPUs was used. For MXNet GoogleNet, there was runtime error if
different bath sizes were used for different number of GPUs, so we used constant batch size. Learning
rate is another hyperparameter that needs to be tuned. In this experiment, the default value in each
framework was used.
Table 2: Input parameters used in different deep learning frameworks
Image shape
Iterations/Epochs
NV-Caffe GoogleNet
CPU
128
224
4000 iterations
1 P100
128
4000 iterations
2 P100
256
2000 iterations
4 P100
512
1000 iterations
TensorFlow Inception-V3
1 P100
64
299
4000 iterations
2 P100
128
2000 iterations
4 P100
256
1000 iterations
MXNet GoogleNet
1-16 P100
144
256
1 epoch
MXNet
Inception-BN
1 P100
64
224
1 epoch
2 P100
128
4 P100
256
8 P100
256
12 P100
256
16 P100
256
Performance Evaluation
Figure 2 shows the training speed (images/sec) and training time (wall-clock time) of GoogleNet neural
network in NV-Caffe using P100 GPUs. It can be seen that the training speed increased as the number of
P100 GPUs increased. As a result, the training time decreased. The CPU result in Figure 2 was obtained
from Intel-Caffe on two Intel Xeon CPU E5-2690 v4 (14-core Broadwell processors) within one node. We
chose Intel-Caffe for the pure CPU test because it has better CPU optimizations than NV-Caffe. From Figure
1, we can see that 1 P100 GPU is ~5.3x and 4 P100 is ~19.7x faster than a Broadwell based CPU server.
Since NV-Caffe has not supported distributed training so far, we only ran it on up to 4 P100 GPUs on one
node.