Reference Guide

16 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

training. This section compares the performance of using FP16 for training versus FP32.

In experiments where training tests were executed using FP16 precision, the batch size was doubled since

FP16 consumes only half the memory for floating points as FP32. Doubling the batch size with FP16 ensures

that GPU memory is utilized equally for both types of tests. The performance comparison of FP16 versus FP32

is shown in Figure 5 for all three frameworks used in this study. Tests were conducted on up to two PowerEdge

C4140 compute nodes. Overall FP16 is 65% to 91% faster than FP32.

Although FP16 makes training faster, it requires extra work in the neural network model implementation to

match the accuracy achieved with FP32. This is because some neural networks require their gradient values

to be shifted into FP16 representable range, and may do some scaling and normalization to use FP16 during

training. For more details, please refer to NVIDIA mixed precision training. In the future work, the image

classification accuracy of applying FP16 will be compared to that of applying FP32.

Figure 5: Performance improvement of FP16 over FP32 for Resnet50 with ILSVRC2012

3.1.2 V100 vs P100

Table 6 compares V100-SXM2 and its previous generation GPU, the P100-SXM2. To demonstrate the

performance advantages of the V100 GPU over its previous generation P100 GPU, the performance of one

node with four V100-SXM2 was compared to that of a node with four P100-SXM2. Figure 6 shows this

performance comparison. The result shows that in FP32 mode, V100 is 26 % faster than P100 when using

TensorFlow, and 52% faster with MXNet. This is because V100-SXM2 has more CUDA cores, higher clock rate

and higher memory bandwidth than P100-SXM2. In FP16 mode, V100 is 103% faster than P100 with

TensorFlow, and 124% faster with MXNet. The reason that V100 is much faster than P100 in FP16 mode is

because P100 does not have Tensor Cores.

Table 6: Comparison between P100-SXM2 and V100-SXM2

P100-SXM2 V100-SXM2

CUDA Cores 3584 5120