Reference Guide

16 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
training. This section compares the performance of using FP16 for training versus FP32.
In experiments where training tests were executed using FP16 precision, the batch size was doubled since
FP16 consumes only half the memory for floating points as FP32. Doubling the batch size with FP16 ensures
that GPU memory is utilized equally for both types of tests. The performance comparison of FP16 versus FP32
is shown in Figure 5 for all three frameworks used in this study. Tests were conducted on up to two PowerEdge
C4140 compute nodes. Overall FP16 is 65% to 91% faster than FP32.
Although FP16 makes training faster, it requires extra work in the neural network model implementation to
match the accuracy achieved with FP32. This is because some neural networks require their gradient values
to be shifted into FP16 representable range, and may do some scaling and normalization to use FP16 during
training. For more details, please refer to NVIDIA mixed precision training. In the future work, the image
classification accuracy of applying FP16 will be compared to that of applying FP32.
Figure 5: Performance improvement of FP16 over FP32 for Resnet50 with ILSVRC2012
3.1.2 V100 vs P100
Table 6 compares V100-SXM2 and its previous generation GPU, the P100-SXM2. To demonstrate the
performance advantages of the V100 GPU over its previous generation P100 GPU, the performance of one
node with four V100-SXM2 was compared to that of a node with four P100-SXM2. Figure 6 shows this
performance comparison. The result shows that in FP32 mode, V100 is 26 % faster than P100 when using
TensorFlow, and 52% faster with MXNet. This is because V100-SXM2 has more CUDA cores, higher clock rate
and higher memory bandwidth than P100-SXM2. In FP16 mode, V100 is 103% faster than P100 with
TensorFlow, and 124% faster with MXNet. The reason that V100 is much faster than P100 in FP16 mode is
because P100 does not have Tensor Cores.
Table 6: Comparison between P100-SXM2 and V100-SXM2
P100-SXM2 V100-SXM2
CUDA Cores 3584 5120