White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell - Internal Use - Confidential

The batch size is one of the hyperparameters the user needs to tune when training a neural network

model with mini-batch Stochastic Gradient Descent (SGD). The batch size in the table are commonly used

sizes. Whether these batch sizes are optimized for model accuracy is left in future work. For all neural

networks in all frameworks, we increased the batch size proportionally with increasing number of GPUs.

In the meantime, the number of iterations was adjusted so that the total number of samples was fixed no

matter how many GPUs were used. Since epoch has nothing to do with batch size, its value was not

changed when a different number of GPUs was used. For MXNet GoogleNet, there was runtime error if

different bath sizes were used for different number of GPUs, so we used constant batch size. Learning

rate is another hyperparameter that needs to be tuned. In this experiment, the default value in each

framework was used.

Table 2: Input parameters used in different deep learning frameworks

Batch size

Image shape

Iterations/Epochs

NV-Caffe GoogleNet

CPU

128

224

4000 iterations

1 P100

128

4000 iterations

2 P100

256

2000 iterations

4 P100

512

1000 iterations

TensorFlow Inception-V3

1 P100

299

4000 iterations

2 P100

128

2000 iterations

4 P100

256

1000 iterations

MXNet GoogleNet

1-16 P100

144

256

1 epoch

MXNet

Inception-BN

1 P100

224

1 epoch

2 P100

128

4 P100

256

8 P100

256

12 P100

256

16 P100

256

Performance Evaluation

Figure 2 shows the training speed (images/sec) and training time (wall-clock time) of GoogleNet neural

network in NV-Caffe using P100 GPUs. It can be seen that the training speed increased as the number of

P100 GPUs increased. As a result, the training time decreased. The CPU result in Figure 2 was obtained

from Intel-Caffe on two Intel Xeon CPU E5-2690 v4 (14-core Broadwell processors) within one node. We

chose Intel-Caffe for the pure CPU test because it has better CPU optimizations than NV-Caffe. From Figure

1, we can see that 1 P100 GPU is ~5.3x and 4 P100 is ~19.7x faster than a Broadwell based CPU server.

Since NV-Caffe has not supported distributed training so far, we only ran it on up to 4 P100 GPUs on one

node.