White Papers

Dell - Internal Use - Confidential

Figure 6 shows the speedup when using multiple P100 GPUs in different deep learning frameworks and

neural networks. The purpose of this figure is to demonstrate the speedup in each framework when

more number of GPUs are used. The purpose does not include the comparison among different

frameworks since their input parameters were different. When using 4 P100 GPUs for NV-Caffe

GoogleNet and TensorFlow Inception-V3, we observed a speedup up to 3.8x and 3.0x, respectively. For

MXNet, using 16 P100 achieved 13.5x speedup in GoogleNet and 14.7x speedup in Inception-BN which

are close to the ideal speedup 16x. In particular, we observed linear speedup when using 8 P100 and 12

P100 GPUs in Inception-BN neural network.

Figure 6: Speedup of multiple P100 GPUs in different DL frameworks and networks

In practice, a real user application can take days or weeks for training a model. Although our benchmarking

cases run in a few minutes or a few hours, they are just small snapshots from much longer runs that would

be needed to really train a network. For example, the training of a real application might take 90 epochs

of 1.2M images. A Dell C4130 with P100 GPUs can turn in results in less than a day, while CPU takes >1

week – that’s the real benefits to the end users. The effect for real use case is saving weeks of time per

run, not seconds.

Conclusions and Future Work

Overall, we observed great speedup and scalability in neural network training when multiple P100 GPUs

were used in Dell’s PowerEdge C4130 server and multiple server nodes were used. The training speed

increased and the training time decreased as the number of P100 GPUs increased. From the results shown,

it is clear that Dell’s PowerEdge C4130 cluster is a powerful tool for significantly speeding up neural

network training.

In the future work, we will try the P100 for NVLink-optimized servers with the same deep learning

frameworks, neural networks and the dataset and see how much performance improvement can be

achieved. This blog experimented the PowerEdge C4130 configuration G in which only GPU 1 and GPU 2,

1.9

3.8

1.9

3.3

6.7

10.1

13.5

2.0

3.9

8.0

12.0

14.7

1.8

3.0

1 P100 2 P100 4 P100 8 P100 12 P100 16 P100

Speedup

Speedup in Images/sec

NV-Caffe GoogleNet MXNet GoogleNet

MXNet Inception-BN TensorFlow Inception-V3