Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130. The C4130 offers a highly configurable system design and ultra-high accelerator/coprocessor density. The combination of these factors have resulted in substantial performance improvement in for a number of industry standard HPC benchmarks and applications.
Extreme GPU Computing THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. © 2015 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Extreme GPU Computing Contents Figures ....................................................................................................................... 2 1. Introduction ........................................................................................................... 3 2. The PowerEdge C4130 .............................................................................................. 3 Unique system configurability .......................................................................
Extreme GPU Computing Figures Figure 1: The PowerEdge C4130 server ................................................................................ 3 Figure 2: Block diagram of the four GPU board configurations available on the C4130 ...................... 4 Figure 3: Block diagram of the two GPU board configurations available on the C4130....................... 5 Figure 4: Unique placement of GPUs.
Extreme GPU Computing 1. Introduction GPU computing is now established and widespread in the HPC community. There is an ever-increasing demand for compute power. This demand has pushed server designs towards higher hardware accelerator density. However, most such designs have a standard system configuration, which may not be optimal for maximum performance across all application classes. The latest high-density design from Dell, the PowerEdge C4130, offers up to four GPU boards in a 1U form factor.
Extreme GPU Computing Configuration A Configuration B Configuration C Figure 2: Block diagram of the four GPU board configurations available on the C4130 Page 4
Extreme GPU Computing Configuration D Configuration E Figure 3: Block diagram of the two GPU board configurations available on the C4130 Table 1: Characteristics of the C4130 configurations C4130 Configuration GPU Boards CPUs Switch Module GPU:CPU ratio A 4 1 Y 8:1 B 4 2 Y 8:2 C 4 2 N 8:2 D 2 2 N 4:2 E 2 1 N 4:2 Comments Single CPU, optimized for peer to peer communication Dual CPUs, optimized for peer to peer communication Dual CPUs, balanced with four GPU boards Dual CPUs
Extreme GPU Computing Ultra-high density The C4130 enables dense GPU computing with up to four accelerators/coprocessors per U. Most current servers in the market offer densities of one or two accelerators/processors per U. The high density combined with configurability proves to be a powerful combination. Accelerator-friendly layout The C4130 is a purpose-built server with an accelerator-friendly thermal design, shown in Figure 4. The accelerators/coprocessors are loaded in the front of the system.
Extreme GPU Computing 3. The Tesla K80 GPU accelerator board The latest HPC-focused Tesla series General Purpose Graphic Units (GPU) released from NVIDIA is the Tesla K80. From an HPC perspective, the most important improvement is the 1.87 TFLOPs (double precision) compute capacity, which is about 30% more than K40, the previous Tesla card. The K80 autoboost feature automatically provides additional performance if additional power head room is available.
Extreme GPU Computing 4. Performance characterization 4.1 Bandwidths between CPU to GPU We measure the host-to-device (H2D) and device-to-host (D2H) bandwidth of the five C4130 configurations. Figure 6 and 7 show the measured bandwidths. Two CPUs and eight GPUs (two internal GPUs per K80 board) yield 16 CPU-to-GPU combinations. The CPU (host) and GPU (device) bandwidth measurements for each configuration is shown in figures below.
Extreme GPU Computing 4.2 Accelerating high performance Linpack (HPL) Four K80 Boards Two K80 Boards Figure 8: HPL performance, efficiency and acceleration compared to CPU-only In this section, we evaluate the performance of C4130 with up to four K80 GPU boards on HPL. Given the importance of HPL in comparing HPC computing systems, this section shows key performance characterization data for the C4130.
Extreme GPU Computing Four K80 Boards Two K80 Boards Figure 9: HPL Power, performance/watt and power consumption compared to CPU-only Figure 9 shows the power consumption data for the HPL runs. In general, GPUs can consume substantial power on compute intensive workloads. As shown above, the power consumption of configurations A, B and C is significantly higher (2.9X to 3.3X), compared to CPU-only runs; this is due to the four K80 GPUs. Power consumption of D and E is lower (1.8X to 2.
Extreme GPU Computing 4.3 Accelerating Molecular Dynamics with NAMD Four K80 Boards Two K80 Boards Figure 10: NAMD performance and acceleration compared to CPU-only The advent of hardware accelerators has influenced Molecular Dynamics by reducing the time to results and therefore providing a tremendous boost in simulation capacity.
Extreme GPU Computing Four K80 Boards Two K80 Boards Figure 11: NAMD power consumption and relative power consumption compared to CPU-only The power consumption and relative power consumption shown in Figure 11 for GPU configurations is about 2.1X to 2.3X resulting in accelerations from 4.4X to 7.8X. From performance per watt perspective (an acceleration of 7.8X for 2.3X more power), configuration C does the best.
Extreme GPU Computing 4.4 Accelerating Molecular Dynamics with LAMMPS Four K80 Boards Two K80 Boards Figure 12: LAMMPS performance and acceleration compared to CPU-only In this section, we evaluate the performance of second common molecular dynamics code, LAMMPS. LAMMPS stands for “Large-scale Atomic/Molecular Massively Parallel Simulator.” LAMMPS is used to model solid-state materials and soft matter. The performance measure is in “Jobs/day” for LAMMPS, and a higher score is better.
Extreme GPU Computing Four K80 Boards Two K80 Boards Figure 13: LAMMPS power and relative power consumption compared to CPU-only Figure 13 shows the power consumption of LAMMPS. The maximum power consumption is in configuration B, but the difference between configurations B and A is small — about 100 watts, implying that the extra CPU is B is not loaded. In case of LAMMPS, the order of power consumption is as follows. B > A >= C > D > E. Overall, the performance of LAMMPS is substantially improved.
Extreme GPU Computing 4.5 Accelerating Molecular Dynamics with GROMACS Four K80 Boards Two K80 Boards Figure 14: GROMACS performance and acceleration compared to CPU-only In this section, we evaluate the performance of the third molecular dynamics application. GROMACS is short for “Groningen Machine for Chemical Simulations.” The primary usage for GROMACS is simulations for biochemical molecules (bonded interactions).
Extreme GPU Computing Four K80 Boards Two K80 Boards Figure 15: GROMACS power and relative power consumption compared to CPU-only Power consumption is another critical factor to consider when using performance-optimized servers as dense as the Dell PowerEdge C4130 with 4 x 300 Watt accelerators. Figure 15 answers questions about how much power these platforms consume. GROMACS shows the order of power consumption is as follows. B >> A >= C > D > E.
Extreme GPU Computing 5. Performance improvement compared to previous generation of PowerEdge C410X solutions The compute power of GPU solutions has increased many times over in recent years. The latest GPUbased PowerEdge C4310 solution offers a substantial performance improvement compared to the previous PowerEdge C410X solution. In this section, we compare the relative performance on the C4130 solution to the C410X-based solution.
Extreme GPU Computing Figure 16: Comparison of HPL performance between C410X and C4130 with four GPU boards Figure 16 shows the comparison on HPL with four GPU boards. The total peak performance goes from 2.2 TFLOP to 8.4 TFLOP — an improvement of 3.8X. The actual achieved or sustained performance is more complex. First to note is the HPL efficiency has gone from 39.9% to 87.8%. This is mainly due to code enhancements and having internal GPUs.
Extreme GPU Computing Figure 18: Comparison of NAMD performance between C410X and C4130 with four GPU boards Now we consider the relative performance on a molecular dynamics application NAMD. Figure 18 shows the performance of NAMD with four GPU boards. In case of NAMD, lower is better, so we see a 3.7X performance improvement on the STMV benchmark. The power consumption is 63% better in C4130 due the improvements in GPU and system design. The performance per watt improvement can be estimated as 3.
Extreme GPU Computing 6. Conclusion In conclusion, the C4130 meets the current challenges of a high-density, accelerator-enabled compute node. Targeted specifically towards the HPC market, it offers world-class performance and unique configurability options to fit extreme HPC requirements.
Extreme GPU Computing Appendix A: Hardware Configuration of the C4130 Server PowerEdge C4130 Processor Memory GPU Power supply Operating System BIOS options 1 or 2 x Intel Xeon CPU E5-2690 v3 @ 2.6 GHz (12 core) 64GB or 128GB @ 2133MHz 2 or 4 x NVIDIA K80 (4,992 CUDA cores, base clock 562 MHz, boost clock 875MHz, power 300W) 2 x 1,600W RHEL 6.5 – kernel 2.6.32-431.el6.
Extreme GPU Computing References 1. 2. 3. 4. 5. 6. 7. 8. 9. Visit http://www.nvidia.com/tesla for more information on GPUs. For general information on Top 500 GPUs, please visit http://www.top500.org For more information on the top Energy efficient supercomputers, please visit http://www.green500.org/ For more information on NAMD, please visit http://www.ks.uiuc.edu/Research/namd/ NAMD, available: http://www.ks.uiuc.edu/Research/namd/ LAMMPS available: http://lammps.sandia.