White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell - Internal Use - Confidential

Figure 2: The training speed and time of GoogleNet in NV-Caffe using P100 GPUs

Figure 3 and Figure 4 show the training speed and time of GoogleNet and Inception-BN neural networks

in MXNet using P100 GPUs. In both figures, 8 P100 used 2 nodes, 12 P100 used 3 nodes and 16 P100 used

4 nodes. As we can see from both figures, MXNet had great scalability in training speed and training time

when more P100 GPUs were used. As mentioned in Section Testing Methodology, if the Ethernet

interfaces in all nodes were used, it would impact the training speed and training time significantly since

the I/O operation was not fast enough to feed the GPU computations. Based on our observation, the

training speed when using Ethernet was only half the speed compared to when using the InfiniBand

interfaces. In both MXNet and TensorFlow, the CPU implementation was extremely slow and we believe

they were not CPU optimized, therefore we did not compare their P100 performance with CPU

performance.

468

894

1755

5760

1151

593

338

200

400

600

800

1000

1200

1400

1600

1800

2000

CPU 1 P100 2 P100 4 P100

1000

2000

3000

4000

5000

6000

7000

Images/sec (higher the better)

Seconds (lower the better)

NV-Caffe GoogleNet ILSVRC12

Training Speed Training Time

220

411

730

1472

2227

2955

5837

3115

1757

871

576

435

500

1000

1500

2000

2500

3000

3500

1000

2000

3000

4000

5000

6000

7000

1 P100 2 P100 4 P100 8 P100 12 P100 16 P100

Images/sec (higher the better)

Seconds (lower the better)

MXNet GoogleNet ILSVRC12

Training Speed Training Time