White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Enhanced Molecular Dynamics Performance

with K80 GPUs

By: Saeed Iqbal & Nishanth Dandapanthula

The advent of hardware accelerators in general has impacted Molecular Dynamics by reducing the time to results

and therefore providing a tremendous boost in simulation capacity (E.g., previous NAMD blogs). Over the course

of time, applications from several domains including Molecular Dynamics have been optimized for GPUS. A

comprehensive (although a constantly growing) list can be found here. LAMMPS and GROMACS are two open

source Molecular Dynamics (MD) applications which can take advantage of these hardware accelerators.

LAMMPS stands for “Large-scale Atomic/Molecular Massively Parallel Simulator” and can be used to model solid

state materials and soft matter . GROMACS is short for “GROningen MAchine for Chemical Simulations”. The

primary usage for GROMACS is simulations for biochemical molecules (bonded interactions) but because of its

efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding

to non-biological systems.

NVIDIA’s K80 offers significant improvements over the previous model the K40. From the HPC prospective the

most important improvement is the 1.87 TFLOPs (double precision) compute capacity, which is about 30% more

than K40. The auto-boost feature in K80 automatically provides additional performance if additional power head

room is available. The internal GPUs are based on the GK210 architecture and have a total of 4,992 cores which

represent a 73% improvement over K40. The K80 has a total memory of 24GBs which is divided equally between

the two internal GPUs; this is a 100% more memory capacity compared to the K40. The memory bandwidth in

K80 is improved to 480 GB/s. The rated power consumption of a single K80 card is a maximum of 300 watts.

Dell has introduced a new high density GPU server, PowerEdge C4130, it offers five configurations, noted here as

“A” through “E”. Part of the goal of this blog is to find out which configuration is best suited for LAMMPS and

GROMACS. The three quad GPU configurations “A”, “B” and “C” are compared. Also the two dual GPU

configurations “D” and “E” are compared for users interested in lower GPU density of 2 GPU per 1 rack unit. The

first two quad GPU configurations (“A” & “B”) have an internal PCIe switch module which allows seamless peer to

peer GPU communication. We also want to understand the impact of the switch module on LAMMPS and

GROMACS. Figure 1 below shows the block diagrams for configurations A to E.

Combining K80s with the PowerEdge C4130, results in an extra-ordinarily powerful compute node. The C4130 can

be configured with up to four K40 or K80 GPUs in a 1U form factor. Also the uniqueness of PowerEdge C4130 is

that it offers several workload specific configurations, potentially making it a better fit, for MD codes in general ,

and specifically for LAMMPS and GROMACS.

Summary of content (5 pages)

PAGE 1
Enhanced Molecular Dynamics Performance with K80 GPUs By: Saeed Iqbal & Nishanth Dandapanthula The advent of hardware accelerators in general has impacted Molecular Dynamics by reducing the time to results and therefore providing a tremendous boost in simulation capacity (E.g., previous NAMD blogs). Over the course of time, applications from several domains including Molecular Dynamics have been optimized for GPUS. A comprehensive (although a constantly growing) list can be found here.
PAGE 2
Figure 1: C4130 Configuration Block Diagram Recently we have evaluated the performance of NVIDIA’s Tesla K80 GPUs on Dell’s PowerEdge C4130 server on standard benchmarks and applications (HPL and NAMD). Performance Evaluation with LAMMPS and GROMACS In this blog, we quantify the performance of two of the molecular dynamics applications; LAMMPS and GROMACS by comparing their performance on K80s to a CPU only.
PAGE 3
Node Interleaving – Disabled CUDA Version and driver BIOS firmware iDRAC firmware LAMMPS GROMACS CUDA 6.5 (340.46) 1.1.0 2.02.01.01 1 Feb 2014 stable version using lib/CUDA for GPU acceleration Benchmark: LJ (128 x 128 x 128) 4.6.
PAGE 4
   Configurations “A”, “B” and “C” are four GPU configurations. Configuration C performs better than A and B. This can be attributed to the PCIe switch in configurations A and B which introduces an extra hop latency when compared to “C” which is a more balanced configuration. Among the two GPU configurations are D and E. Configuration D performs slightly better than E and this could again be attributed to the balanced nature of D. As mentioned previously, LAMMPS is not offset by the extra CPU in D.
PAGE 5
Performance is not the only criteria when a performance optimized server as dense as the Dell PowerEdge C4130 with 4 x 300 Watt accelerators is used. The other dominating factor is how much power these platforms consume. Figures 4 answers questions pertaining to power.  In case of LAMMPS the order of power consumption is as follows. B > A >= C > D > E o Configuration B is a switched configuration and has an extra CPU then Configuration A.