White Papers

Dell - Internal Use - Confidential
Figure 2: The training speed and time of GoogleNet in NV-Caffe using P100 GPUs
Figure 3 and Figure 4 show the training speed and time of GoogleNet and Inception-BN neural networks
in MXNet using P100 GPUs. In both figures, 8 P100 used 2 nodes, 12 P100 used 3 nodes and 16 P100 used
4 nodes. As we can see from both figures, MXNet had great scalability in training speed and training time
when more P100 GPUs were used. As mentioned in Section Testing Methodology, if the Ethernet
interfaces in all nodes were used, it would impact the training speed and training time significantly since
the I/O operation was not fast enough to feed the GPU computations. Based on our observation, the
training speed when using Ethernet was only half the speed compared to when using the InfiniBand
interfaces. In both MXNet and TensorFlow, the CPU implementation was extremely slow and we believe
they were not CPU optimized, therefore we did not compare their P100 performance with CPU
performance.
89
468
894
1755
5760
1151
593
338
0
200
400
600
800
1000
1200
1400
1600
1800
2000
CPU 1 P100 2 P100 4 P100
0
1000
2000
3000
4000
5000
6000
7000
Images/sec (higher the better)
Seconds (lower the better)
NV-Caffe GoogleNet ILSVRC12
Training Speed Training Time
220
411
730
1472
2227
2955
5837
3115
1757
871
576
435
0
500
1000
1500
2000
2500
3000
3500
0
1000
2000
3000
4000
5000
6000
7000
1 P100 2 P100 4 P100 8 P100 12 P100 16 P100
Images/sec (higher the better)
Seconds (lower the better)
MXNet GoogleNet ILSVRC12
Training Speed Training Time