Dell HPC Omni-Path Fabric: Supported Architecture and Application Study Deepthi Cherlopalle Joshua Weage Dell HPC Engineering June 2016
Revisions Date Description June 2016 Initial release – v1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. Copyright © 2016 Dell Inc. All rights reserved. Dell and the Dell logo are trademarks of Dell Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies.
Table of contents Revisions ............................................................................................................................................................................................. 2 Executive Summary .......................................................................................................................................................................... 5 Audience ............................................................................................
List of Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Figure 17 Intel® Omni-Path Host Fabric Interface Adapter 100 series 1 port PCIe x16 ........................................................9 Dell H-Series Switches ...................................................................................................................................................10 Dell H1024-OPF Edge Switch ..
Executive Summary In the world of High Performance Computing (HPC), servers with high-speed interconnects play a key role in the pursuit to achieve exascale performance. Intel’s Omni-Path Architecture (OPA) is the latest addition to the interconnect universe and is a part of Intel Scalable System framework. It is based on innovations from Intel’s True Scale technology, Cray’s Aries interconnect, internal Intel® IP and from several other open source platforms.
Dell HPC Omni-Path Fabric: Supported Architecture and Application Study June 2016
1 Introduction The High Performance Computing (HPC) domain primarily deals with problems which surpass the capabilities of a standalone machine. With the advent of parallel programming, applications can scale past a single server. High performance interconnects provide low latency and high bandwidth which are needed for the application to divide the computational problem among multiple nodes, distribute data and then merge partial results from each node to a final result.
Sending the data Programmed I/O (PIO): This supports the on-load model. The host can be used to send small messages since these can be sent by the CPU faster than an RDMA setup time. Send DMA (SDMA): For larger messages the CPU sets up a RDMA send and then the 16 SDMA engines in the HFI transfer the data to the receiving host without CPU intervention. Receiving the data Eager receive: The data is delivered to the host memory and then copied to the application memory.
2 Dell Networking H-Series Fabric Dell Networking H-Series Fabric is a comprehensive fabric solution that includes host adapters, edge and director class switches, cabling and complete software and management tools. 2.1 Intel® Omni-Path Host Fabric Interface (HFI) Dell provides support for Intel® Omni-Path HFI 100 series cards [3] which are PCIe Gen3 x16 and capable of 100Gbps per port.
Figure 2 Dell H-Series Switches 2.2.1 Dell H-Series Edge Switches Dell H-Series Edge Switches based on the Intel® Omni-Path Architecture consist of two models supporting 100Gbps for all ports: an entry-level 24-port switch targeting entry-level/small clusters and a 48-port switch which can be combined with other edge switches and directors to build larger clusters.
2.2.2 Dell H-Series Director-Class Switches Dell H-Series Director-Class Switches based on the Intel® Omni-Path Architecture consist of two models supporting 100Gbps for all ports: a 192-port switch and a 768-port switch. These switches support HPC clusters of all sizes, from mid-level clusters to supercomputers.
3 Intel® Omni-Path Fabric Software 3.1 Available Installation Packages The following packages [4] are available for an Intel® Omni-Path Fabric: 3.1.1 Intel® Omni-Path Fabric Host Software – This is the basic installation package that provides Intel ® Omni-Path Fabric Host components needed to set up compute, I/O and service nodes with drivers, stacks and basic tools for local configuration and monitoring. This package is usually installed on compute-nodes.
3.2 Intel® Omni-Path Fabric Manager GUI Fabric Manager GUI [5] provides a set of analysis tools for graphically monitoring fabric status and managing fabric health. This package is open source and can be run on a Linux or Windows system with TCP/IP connectivity to the Fabric Manager. To use the Fabric Manager GUI enable the following switch settings and ensure the opafm.xml files are identical on all switches.
3.3 Chassis Viewer Chassis viewer is a web interface which can be used to manage basic functionalities on edge switches with and without management cards. The following picture exemplifies the basic layout of Chassis Viewer: Figure 7 Intel® Omni-Path Chassis Viewer Overview 1. The LED lights at the center are “green” if the ports are active and “white” when in the polling state or the cables are not connected.
3.4 OPA FastFabric FastFabric [6] is a set of fabric management tools used for fabric deployments, switch management and host management. This can be accessed if Intel Fabric Suite is installed on the nodes. The following screen appears when the “opafastfabric” command is given. Figure 8 3.4.
Sweeping the fabric to discover topology changes, managing those changes when nodes are added or deleted, etc. Adaptive routing Congestion Control Subnet Administration Subnet Administrator (SA) actively engages with Subnet Manager (SM) to store and retrieve the fabric information. The SM/SA is single unified entity. With the help of SA messages, nodes connected to the fabric can have node-to-node path information, fabric topology and configuration, event notifications etc.
Figure 9 Starting Embedded Subnet Manager Controlling Subnet Manager using Switch CLI The default login credentials for the switch is admin/adminpass. From the switch CLI, use the following commands to manage the SM. 3.5.2 smcontrol start smcontrol stop smcontrol restart smcontrol status Host-Based Fabric Manager Host-based Fabric Manager can run on a compute node/host. This package is available in the Intel Fabric Suite.
4 Test bed and configuration This section explains the server configuration, BIOS options and application versions used for the application performance study. This study utilized the Zenith cluster located in the Dell HPC Innovation Lab. Component Server Details 32 PowerEdge R630 Intel® Xeon® CPU E5-2697 v3 @ 2.60GHz No. of cores: 14 Processor Processor Base Freq: 2.6GHz AVX Base Freq: 2.2GHz Memory 8*8 GB @ 2133MHz Operating System Red Hat Enterprise Linux Server release 7.
Application Version MPI Benchmark osu_latency OSU 4.4.1 OpenMPI-hfi-1.10 osu_bw osu_bibw Apoa1 NAMD 2.11 Intel MPI 5.1.3 F1atpase Stmv WRF 3.8 Intel MPI 5.1.3 Conus 2.5km eddy_417k pump_2m aircraft_wing_2m ice_2m fluidized_bed_2m rotor_3m ANSYS® Fluent® 17.0 Platform MPI 9.1.3.1 sedan_4m oil_rig_7m combustor_12m truck_poly_14m aircraft_wing_14m landing_gear_15m lm6000_16m EglinStoreSeparation KcsWithPhysics TurboCharger CD-adapco® STAR-CCM+® Reactor_9m 11.02.010 Platform MPI 9.1.4.
5 Performance Benchmarking Results 5.1 Latency OSU Micro-benchmarks were used to determine latency. These latency tests were done in Ping-Pong fashion. HPC applications need low latency and high throughput. As seen in the graph below, the back to back latency is 0.77µs and switch latency is 0.9µs which is on par with industry standards. OSU_Latency 1.2 1.09 1.1 1.03 1.03 1.03 0.9 0.91 0.91 16 32 64 Time(us) 1 0.9 0.9 0.9 0.9 0.9 0.78 0.78 0.77 0.77 0.77 0 1 2 4 8 1.05 0.93 1.
OSU Bandwidth Bandwidth(GB/s) 30 25 21.2 20 14 15 10 6.4 4 5 3.3 0 5.1 8.7 7.1 22.5 23 23.9 24.5 24.3 24.3 12.1 12.3 12.3 12.3 12.3 12.3 16.9 10.1 10.2 7.8 11.8 12.0 Message Size(bytes) Bi-directional Figure 11 5.3 Uni-directional OSU Bandwidth values based on Intel® Xeon® CPU E5-2697 v4 processor. Weather Research Forecast Weather Research and Forecasting Model [8] is a weather prediction system designed for atmospheric research and operational forecasting needs.
5.4 NAMD NAMD [9] is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Three proteins ApoA1 (92,224 atoms), F1ATpase (327,506 atoms) and STMV (1,066,628) are used in our study due to their relatively large problem size. Figure 13 illustrates the performance of NAMD for three different datasets apoa1, f1atpase and stmv. All the datasets show results for node counts from 1 to 32.
Performance Relative to 28 Cores (1 node) ANSYS® Fluent® Relative Performance 40 35 30 25 20 15 10 5 0 28 (1) 56 (2) 112 (4) 224 (8) 448 (16) 896 (32) Number of Cores (nodes) Figure 14 aircraft_wing_14m combustor_12m fluidized_bed_2m landing_gear_15m lm6000_16m pump_2m rotor_3m sedan_4m truck_poly_14m ANSYS® Fluent® Relative Performance Graph (1/2) ANSYS® Fluent® Relative Performance Performance Relative to 28 Cores (1 node) 20 18 16 14 12 10 8 6 4 2 0 28 (1) 56 (2) 112 (4) 224 (8) 4
5.6 CD-adapco® STAR-CCM+® At the time of publication of this whitepaper, Intel Omni-Path was not officially supported by CD-adapco® STAR-CCM+®. In order to obtain preliminary performance data for this application, the MPI software was modified to use the appropriate Intel® Omni-Path library. Multiple cases from the STAR-CCM+ benchmark suite were tested on the lab test system. The relative performance of eight benchmark cases are presented in this section.
STAR-CCM+® Relative Performance Performance Relative to 28 Cores (1 node) 25 20 15 10 5 0 28 (1) 56 (2) 112 (4) 224 (8) 448 (16) Number of Cores (nodes) EglinStoreSeparation Figure 17 25 KcsWithPhysics TurboCharger STAR-CCM+® Relative Performance (2/2) Dell HPC Omni-Path Fabric: Supported Architecture and Application Study June 2016 896 (32)
6 Conclusion and Future Work The Intel® Omni-Path Architecture is a new option available for low-latency, high-bandwidth cluster fabrics. The micro benchmark results prove that OPA is an ideal candidate for HPC workloads. Good application scalability is demonstrated with NAMD, WRF, STAR-CCM+, and ANSYS Fluent. Users of Intel® Omni-Path benefit from the freely available fabric monitoring and managing tools such as Fabric Manager GUI, chassis viewer, and opafastfabric.
7 27 References [1] [Online]. Available: http://www.intel.com/content/dam/www/public/us/en/documents/productbriefs/transforming-economics-hpc-fabrics-opa-brief.pdf. [2] [Online]. Available: http://www.intel.com/content/www/us/en/high-performancecomputing-fabrics/omni-path-architecture-fabric-overview.html. [3] [Online]. Available: http://www.intel.com/content/www/us/en/high-performancecomputing-fabrics/omni-path-host-fabric-interface.html. [4] [Online]. Available: http://www.intel.