White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell - Internal Use - Confidential

Application Performance on P100-PCIe GPUs

Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017

Introduction to P100-PCIe GPU

This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell

PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based

server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double

and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and

one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance,

respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed

the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to

compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical

and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly

speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four

NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed

information about the hardware and software used in every compute node.

Table 1: Experiment Platform and Software Details

Platform

PowerEdge C4130 (configuration G)

Processor

2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)

Memory

256GB DDR4 @ 2400MHz

Disk

9TB HDD

GPU

P100-PCIe with 16GB GPU memory

Nodes Interconnects

Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)

Infiniband Switch

Mellanox SB7890

Software and Firmware

Operating System

RHEL 7.2 x86_64

Linux Kernel Version

3.10.0-327.el7

BIOS

Version 2.3.3

CUDA version and driver

CUDA 8.0.44 (375.20)

OpenMPI compiler

Version 2.0.1

Summary of content (11 pages)

PAGE 1
Application Performance on P100-PCIe GPUs Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017 Introduction to P100-PCIe GPU This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively.
PAGE 2
GCC compiler 4.8.5 Intel Compiler Version 2017.0.098 Applications HPL Version hpl_cuda_8_ompi165_gcc_485_pascal_v1 LAMMPS Version Lammps-30Sep16 NAMD Version NAMD_2.12_Source GROMACS Version 2016.1 HOOMD-blue Version 2.1.2 Amber Version 16update7 ANSYS Mechanical Version 17.0 RELION Version 2.0.
PAGE 3
HPL Performance Scaling on P100-PCIe 70 100 93 86 82 81 57.8 84 81 90 85 50 80 70 41.8 60 40 50 29.4 30 40 20 30 15.5 20 7.9 10 1.1 Efficiency(%) Performance (TFLOPS) 60 3.9 10 0 0 CPU(2x 2690 v4) 1 P100 2 P100 4 P100 TFLOPS 8 P100 12 P100 16 P100 Efficiency Figure 1: HPL performance on P100-PCIe NAMD NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for highperformance simulation of large biomolecular systems.
PAGE 4
NAMD Performance with STMV Dataset within 1 node 2.90 2.84 2.77 ns/day (higher the better) 2.80 2.70 2.60 2.50 2.41 2.40 2.30 2.20 1 P100 2 P100 4 P100 Figure 2: NAMD Performance within 1 P100-PCIe node NAMD Performance with STMV Dataset ns/day (higher the better) 6.00 5.08 4.98 4.92 5.00 4.20 4.00 4.36 3.78 2.84 2.77 3.00 2.00 1.00 0.30 0.
PAGE 5
example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes. ns/day (higher is better) GROMACS Performance with Water 0768 Dataset 45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 41.0 35.8 35.2 26.7 19.3 16.1 10.5 26.1 14.5 11.1 7.4 3.
PAGE 6
performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs. LAMMPS Performance with lj dataset 7.0 1200 1095 5.4 1000 1172 5.8 6.0 5.0 801 800 4.0 4.0 505 600 355 400 2.0 1.8 202 200 3.0 2.5 Speedup Timesteps/s (higher the better) 1400 1.0 1.0 0 0.
PAGE 7
Timesteps/s (higher the better) LAMMPS Performance with lj Dataset 1400 1172 1200 1000 801 689 800 600 351 400 200 544 505 334 164 77 0 1 2 4 Number of Nodes 2x2690 v4 CPU K80 P100-PCIe Figure 7: LAMMPS Performance Comparison HOOMD-blue HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.
PAGE 8
1000.0 4.5 328.2 3.9 4.0 3.5 3.5 3.1 100.0 3.0 2.5 24.6 16.8 10.0 11.7 2.1 1.5 2.0 7.9 7.1 6.3 1.5 1.0 1.0 Speedup over 1 P100 Hours for 10e6 steps (lower is better) HOOMD-blue Performance with Microsphere Dataset 0.5 1.0 0.
PAGE 9
Amber Performance with DHFR (NVE) HMR 4fs Dataset 1.24 700 ns/day (higher is better) 1.40 713 1.20 575 600 1.00 1.00 500 0.80 400 315 0.55 300 292 0.51 284 0.49 0.60 272 0.47 200 100 92 0.40 Speedup over 1 P100 800 0.20 0 0.
PAGE 10
2700 2.8 3000 2500 3.0 2.5 2125 2.2 2000 1500 956 1.0 1000 2.0 1394 1.5 1.5 1200 949 1.0 604 252 500 Speedup Core Solver Rating (higher is better) Ansys Performance with V17cg-1 Benchmark 0.5 0 0.
PAGE 11
Conclusions and Future Work In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses.