White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Ready Solutions Engineering Test Results

the property of their respective owners. Published in the USA. Dell EMC believes the information in this document is accurate as of its publication date. The information is

subject to change without notice.

Deep Learning on V100

Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula.

HPC Innovation Lab. September 2017

Overview

In this blog, we will introduce the NVIDIA Tesla Volta-based V100 GPU and evaluate it with different deep learning frameworks. We will

compare the performance of the V100 and P100 GPUs. We will also evaluate two types of V100: V100-PCIe and V100-SXM2. The results

indicate that in training V100 is ~40% faster than P100 with FP32 and >100% faster than P100 with FP16, and in inference V100 is 3.7x

faster than P100. This is one blog of our Tesla V100 blog series. Another blog of this series is about the general HPC applications

performance on V100 and you can read it here.

Introduction to V100 GPU

In the 2017 GPU Technology Conference (GTC), NVIDIA announced the Volta-based V100 GPU. Similar to P100, there are also two

types of V100: V100-PCIe and V100-SXM2. V100-PCIe GPUs are inter-connected by PCIe buses and the bi-directional bandwidth is up

to 32 GB/s. V100-SXM2 GPUs are inter-connected by NVLink and each GPU has six links and the bi-directional bandwidth of each link

is 50 GB/s, so the bi-directional bandwidth between different GPUs is up to 300 GB/s. A new type of core added in V100 is called tensor

core which was designed specifically for deep learning. These cores are essentially a collection of ALUs for performing 4x4 matrix

operations: specifically a fused multiply add (A*B+C), multiplying two 4x4 FP16 matrices together, and then adding to a FP16/FP32 4x4

matrix to generate a final 4x4 FP16/FP32 matrix. By fusing matrix multiplication and add in one unit, the GPU can achieve high FLOPS

for this operation. A single Tensor Core performs the equivalent of 64 FMA operations per clock (for 128 FLOPS total), and with 8 such

cores per Streaming Multiprocessor (SM), 1024 FLOPS per clock per SM. By comparison, even with pure FP16 operations, the standard

CUDA cores in a SM only generate 256 FLOPS per clock. So in scenarios where these cores can be used, V100 is able to deliver 4x the

performance versus P100. The detailed comparison between V100 and P100 is in Table 1.

Table 1: The comparison between V100 and P100

P100-PCIe

V100-PCIe

Improvement

P100-SXM2

V100-SXM2

Improvement

CUDA Cores

3584

5120

3584

5120

Tensor Cores

N/A

640

N/A

640

Boost Clock

1329 MHz

1380 MHz

1481 MHz

1530 MHz

Memory Bandwidth

732 GB/s

900 GB/s

22.95%

732 GB/s

900 GB/s

22.95%

NVLink Bi-bandwidth

N/A

160 GB/s

300 GB/s

Double Precision

4.7 TFLOPS

7 TFLOPS

1.5x

5.3 TFLOPS

7.8 TFLOPS

1.5x

Single Precision

9.3 TFLOPS

14 TFLOPS

1.5x

10.6 TFLOPS

15.7 TFLOPS

1.5x

Deep Learning

18.6 TFLOPS

112 TFLOPS

21.2 TFLOPS

125 TFLOPS

Architecture

Pascal

Volta

Pascal

Volta

TDP

250W

300W

Testing Methodology

Summary of content (6 pages)

PAGE 1
Ready Solutions Engineering Test Results 1 Deep Learning on V100 Authors: Rengan Xu, Frank Han, Nishanth Dandapanthula. HPC Innovation Lab. September 2017 Overview In this blog, we will introduce the NVIDIA Tesla Volta-based V100 GPU and evaluate it with different deep learning frameworks. We will compare the performance of the V100 and P100 GPUs. We will also evaluate two types of V100: V100-PCIe and V100-SXM2.
PAGE 2
As in our previous deep learning blog, we still use the three most popular deep learning frameworks: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Both NV-Caffe and MXNet have been optimized for V100. TensorFlow still does not have any official release to support V100, but we applied some patches obtained from TensorFlow developers so that it is also optimized for V100 in these tests.
PAGE 3
In this experiment, we trained various deep learning frameworks with one pass on the whole dataset since we were comparing only the training speed, not the training accuracy. Other important input parameters for different deep learning frameworks are listed in Table 3. For NV-Caffe and MXNet, in terms of different batch size, we doubled the batch size for FP16 tests since FP16 consumes half the memory for floating points as FP32.
PAGE 4
NV-Caffe Resnet50 2500 2223 2147 Images/sec 2000 1500 1000 1019 807 1148 863 1108 1215 FP16 FP32 500 0 FP32 FP16 P100 FP32 FP16 FP32 V100 P100 PCIe (config G) FP16 V100 SXM2 (config K) 4 GPUs Figure 2: Performance of V100 vs P100 with NV-Caffe MXNet Resnet50 3000 Images/sec 2388 2331 2500 2000 1000 1416 1287 1500 1030 890 1114 946 500 0 FP32 FP16 FP32 P100 FP16 V100 PCIe(config G) FP32 FP16 FP32 P100 V100 SXM2(config K) Figure 3: Performance of V100 vs P100
PAGE 5
TensorFlow Resnet50 1200 Images/sec 1000 1082 1065 811 785 800 600 400 200 0 P100 V100 P100 PCIe(config G) V100 SXM2(config K) 4 GPUs Figure 4: Performance of V100 vs P100 with TensorFlow Table 4: Improvement of V100 compared to P100 Batch Size PCIe FP32 FP16 SXM2 FP32 FP16 V100 vs P100 (%) NV-Caffe 42.23% MXNet 44.67% TensorFlow 31.60% NV-Caffe 110.68% MXNet 126.26% NV-Caffe 40.72% MXNet 49.76% TensorFlow 33.14% NV-Caffe 100.56% MXNet 114.
PAGE 6
3.7x faster inference on V100 vs P100 5623 30.00 5000 25.00 4000 20.00 3000 15.00 2000 1530 6.54 6.94 1000 10.00 Latency (ms) Images/sec 6000 5.00 0 0.00 P100 V100 images/sec latency Figure 5: Resnet50 inference performance on V100 vs P100.