White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell - Internal Use - Confidential

New NVIDIA V100 32GB GPUs, Initial performance

results

Deepthi Cherlopalle, HPC and AI Innovation Lab. June 2018

GPUs are useful for accelerating large matrix operations, analytics, deep learning workloads and

several other use cases. NVIDIA introduced the Pascal line of their Tesla GPUs in 2016, the Volta line of

GPUs in 2017, and recently announced their latest Tesla GPU based on the Volta architecture with 32GB

of GPU memory. The V100 GPU is available with both PCIe and NVLink version, allowing GPU-to-GPU

communication over PCIe or over NVLink. The NVLink version of the GPU is also called an SXM2 module.

This blog will give an introduction to the new Volta V100-32GB GPUs and compare the HPL performance

between different V100 models. Tests were performed using a Dell EMC PowerEdge C4140 with both PCIe

and SXM2 configurations. There are several other platforms which support GPUs: PowerEdge R740,

PowerEdge R740XD, PowerEdge R840, and PowerEdge R940xa. A similar study was conducted in the past

comparing the performance of the P100 and V100 GPUs with the HPL, HPCG, AMBER, and LAMMPS

applications.

Table 1 below provides an overview of Volta device specifications.

Table 1: GPU Specifications

Tesla V100-PCIe

Tesla V100-SXM2

GPU Architecture

Volta

NVIDIA Tensor cores

640

NVIDIA CUDA Cores

5140

GPU Max Clock Rate

1380MHz

1530MHz

Double precision

performance

7TFlops

7.8TFlops

Single precision

performance

14TFlops

15.7TFlops

GPU memory

16/32GB

Interconnect

Bandwidth

32GB/s

300GB/s

System Interface

PCIe Gen3

NVIDIA NVLink

Max Power

Consumption

250 watts

300 watts

Summary of content (7 pages)

PAGE 1
New NVIDIA V100 32GB GPUs, Initial performance results Deepthi Cherlopalle, HPC and AI Innovation Lab. June 2018 GPUs are useful for accelerating large matrix operations, analytics, deep learning workloads and several other use cases. NVIDIA introduced the Pascal line of their Tesla GPUs in 2016, the Volta line of GPUs in 2017, and recently announced their latest Tesla GPU based on the Volta architecture with 32GB of GPU memory.
PAGE 2
The PowerEdge C4140 Server is an accelerator optimized server with support for two Intel Xeon Scalable processors and four NVIDIA Tesla GPUs (PCIe or NVLink) in a 1U form factor. The PCIe version of the GPUs is supported with standard PCIe Gen3 connections between GPU to CPU. The NVLink configuration allows GPU-to-GPU communication over the NVLink interconnect. Applications that can take advantage of the higher NVLink bandwidth and the higher clock rate of the V100-SXM2 module can benefit from this option.
PAGE 3
The PowerEdge C4140 platform can support a variety of Intel Xeon CPU models, up to 1.5 TB of memory with 24 DIMM slots, multiple network adapters and provides several local storage options. For more information on this server click here. To evaluate the performance difference between the V100-16GB and the V100-32GB GPUs, a series of tests were conducted. These tests were run on a single PowerEdge C4140 server with the configurations detailed below in Table 2, Table 3 and Table 4.
PAGE 4
   Rpeak is the theoretical peak of the system. Rmax is the maximum measured performance achieved on the system. The efficiency is defined as the ratio of Rmax to Rpeak. The resultant performance of HPL is reported in GFLOPS. N is the problem size provided as input to the benchmark and determines the size of the dense linear matrix that is solved by HPL.
PAGE 5
HPL on C4140 with V100 30.00 Rmax- TFlops 25.00 23.99 22.55 20.06 19.31 20.00 16% 14% 15.00 10.00 5.00 2.33 0.00 HPL C4140-CPU-6148 ConfigB-V100-16GBPCIe 4GPUs ConfigB-V100-32GBPCIe 4GPUs ConfigK-V100-16GBSXM2 4GPUs ConfigK-V100-32GBSXM2 4GPUs 2.33 19.31 22.55 20.06 23.99 C4140 Configurations Figure 5: HPL Performance on different C4140 configurations. Figure 6 and Figure 7 shows the performance of V100 16GB vs 32GB GPU for different values of N.
PAGE 6
HPL-Volta 16GB vs 32GB (1/2) 25.00 20.00 19.18 19.80 19.98 19.76 18.71 18.93 18.98 20.34 20.32 20.54 19.72 18.68 Rpeak-TFlops 18.96 19.76 18.37 19.01 18.51 17.90 15.00 20.77 20.81 19.76 19.82 17.21 15.53 14.52 10.00 5.00 V100-16GB and V100-32GB perform well as the problem sizes fit in the GPU memory. 0.
PAGE 7
HPL-Volta 16GB vs 32GB (2/2) 30.00 25.00 21.04 20.77 21.28 21.64 22.38 23.29 23.48 23.94 24.17 23.72 22.39 20.00 Rpeak-TFlops 23.99 22.73 21.54 19.94 14.34 15.00 13.66 13.17 10.00 12.41 9.05 V100-16GB performance drops for larger N values and V100-32GB continues to execute well as it fit the larger problem size in GPU memory. . 9.21 5.00 5.54 4.01 0.00 3.13 3.01 2.90 2.81 2.71 2.63 2.31 2.