White Papers

Dell - Internal Use - Confidential
Figure 6 shows the speedup when using multiple P100 GPUs in different deep learning frameworks and
neural networks. The purpose of this figure is to demonstrate the speedup in each framework when
more number of GPUs are used. The purpose does not include the comparison among different
frameworks since their input parameters were different. When using 4 P100 GPUs for NV-Caffe
GoogleNet and TensorFlow Inception-V3, we observed a speedup up to 3.8x and 3.0x, respectively. For
MXNet, using 16 P100 achieved 13.5x speedup in GoogleNet and 14.7x speedup in Inception-BN which
are close to the ideal speedup 16x. In particular, we observed linear speedup when using 8 P100 and 12
P100 GPUs in Inception-BN neural network.
Figure 6: Speedup of multiple P100 GPUs in different DL frameworks and networks
In practice, a real user application can take days or weeks for training a model. Although our benchmarking
cases run in a few minutes or a few hours, they are just small snapshots from much longer runs that would
be needed to really train a network. For example, the training of a real application might take 90 epochs
of 1.2M images. A Dell C4130 with P100 GPUs can turn in results in less than a day, while CPU takes >1
week that’s the real benefits to the end users. The effect for real use case is saving weeks of time per
run, not seconds.
Conclusions and Future Work
Overall, we observed great speedup and scalability in neural network training when multiple P100 GPUs
were used in Dell’s PowerEdge C4130 server and multiple server nodes were used. The training speed
increased and the training time decreased as the number of P100 GPUs increased. From the results shown,
it is clear that Dell’s PowerEdge C4130 cluster is a powerful tool for significantly speeding up neural
network training.
In the future work, we will try the P100 for NVLink-optimized servers with the same deep learning
frameworks, neural networks and the dataset and see how much performance improvement can be
achieved. This blog experimented the PowerEdge C4130 configuration G in which only GPU 1 and GPU 2,
1
1.9
3.8
1
1.9
3.3
6.7
10.1
13.5
1
2.0
3.9
8.0
12.0
14.7
1
1.8
3.0
0
2
4
6
8
10
12
14
16
1 P100 2 P100 4 P100 8 P100 12 P100 16 P100
Speedup
Speedup in Images/sec
NV-Caffe GoogleNet MXNet GoogleNet
MXNet Inception-BN TensorFlow Inception-V3