Technical White Paper Dell EMC Ready Solution for HPC Digital Manufacturing—ANSYS® Performance Abstract This Dell EMC technical white paper discusses performance benchmarking results and analysis for ANSYS® CFX®, Fluent®, and Mechanical™ on the Dell EMC Ready Solution for HPC Digital Manufacturing.
Revisions Revisions Date Description January 2018 Initial release with Intel® Xeon® Scalable processors (code name Skylake) February 2019 Added Basic Building Block April 2019 Revised with 2nd generation Intel Xeon Scalable processors (code name Cascade Lake) Acknowledgements This paper was produced by the following: Authors: Joshua Weage Martin Feyereisen The information in this publication is provided “as is.” Dell Inc.
Table of contents Table of contents Revisions.............................................................................................................................................................................2 Acknowledgements .............................................................................................................................................................2 Table of contents .................................................................................................
Introduction 1 Introduction This technical white paper discusses the performance of ANSYS® CFX®, Fluent® and Mechanical™ on the Dell EMC Ready Solution for HPC Digital Manufacturing. This Dell EMC Ready Solution for HPC was designed and configured specifically for Digital Manufacturing workloads, where Computer Aided Engineering (CAE) applications are critical for virtual product development.
System Building Blocks 2 System Building Blocks The Dell EMC Ready Solution for HPC Digital Manufacturing is designed using preconfigured building blocks. The building block architecture allows an HPC system to be optimally designed for specific end-user requirements, while still making use of standardized, domain-specific system recommendations. The available building blocks are infrastructure servers, storage, networking, and compute building blocks.
System Building Blocks A recommended base configuration for infrastructure servers is: • • • • • • • • Dell EMC PowerEdge R640 server Dual Intel® Xeon® Bronze 3106 processors 192 GB of RAM (12 x 16GB 2667 MTps DIMMs) PERC H330 RAID controller 2 x 480GB Mixed-Use SATA SSD RAID 1 Dell EMC iDRAC9 Enterprise 2 x 750 W power supply units (PSUs) Mellanox EDR InfiniBandTM (optional) The recommended base configuration for the infrastructure server is described as follows.
System Building Blocks Table 1 Recommended Configurations for the Compute Building Block Platforms Processors Dual Intel Xeon Gold 6242 (16 cores per socket) Dual Intel Xeon Gold 6248 (20 cores per socket) Dual Intel Xeon Gold 6252 (24 cores per socket) Memory Options 192 GB (12 x 16GB 2933 MTps DIMMs) 384 GB (12 x 32GB 2933 MTps DIMMs) 768 GB (24 x 32GB 2933 MTps DIMMs, R640 only) Storage Options PERC H330, H730P or H740P RAID controller 2 x 480GB Mixed-Use SATA SSD RAID 0 4 x 480GB Mixed-Use SATA SS
System Building Blocks SSD’s in RAID 0 are used to provide fast local I/O. The compute nodes do not normally require extensive OOB management capabilities; therefore, an iDRAC9 Express is recommended. Additionally, two BBB’s can be directly coupled together via a high-speed network cable, such as InfiniBand or Ethernet, without need of an additional high-speed switch if additional compute capability is required for each simulation run (HPC Couplet).
System Building Blocks with customers indicates that there is no ‘one size fits all’ operational and archival storage solution. Many customers rely on their corporate enterprise storage for archival purposes and instantiate a high performance operational storage system dedicated for their HPC environment. Operational storage is typically sized based on the number of expected users. For fewer than 30 users, a single storage server, such as the Dell PowerEdge R740xd is often an appropriate choice.
System Building Blocks For customers desiring a shared high-performance parallel filesystem, the Dell EMC Ready Solution for HPC Lustre Storage solution shown in Figure 3 is appropriate. This solution can scale up to multiple petabytes of storage. Figure 3 Dell EMC Ready Solution for Lustre Storage Reference Architecture 2.5 System Networks Most HPC systems are configured with two networks—an administration network and a high-speed/lowlatency switched fabric.
System Building Blocks 2.7 Services and Support The Dell EMC Ready Solution for HPC Digital Manufacturing is available with full hardware support and deployment services, including additional HPC system support options.
Reference System 3 Reference System The reference system was assembled in the Dell EMC HPC and AI Innovation Lab using the building blocks described in section 2. The building blocks used for the reference system are listed in Table 2.
Reference System The software versions used for the reference system are listed in Table 4. Table 4 Software Versions 13 Component Version Operating System Kernel RHEL 7.6 Windows Server 2016 (BBB) 3.10.0-957.el7.x86_64 OFED Mellanox 4.5-1.0.1.0 Bright Cluster Manager 8.2 ANSYS CFX 2019R1 (19.2 for Windows) ANSYS Fluent 2019R1 (19.2 for Windows) ANSYS Mechanical 2019R1 (19.
ANSYS CFX Performance 4 ANSYS CFX Performance ANSYS CFX software is a Computational Fluid Dynamics (CFD) application recognized for its accuracy, robustness and speed with rotating machinery applications. CFD applications typically scale well across multiple processor cores and servers, have modest memory capacity requirements, and typically perform minimal disk I/O while in the solver section. However, some simulations, such as large transient analysis, may have greater I/O demands.
ANSYS CFX Performance Performance Relative to 48 Cores ANSYS CFX Parallel Scaling—Intel Xeon Gold 6252 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 48 (1) 96 (2) 192 (4) 288 (6) 384 (8) Number of Cores (Number of Nodes) Airfoil_10M Airfoil_50M Airfoil_100M LeMans Pump Figure 5 ANSYS CFX Parallel Scaling—Intel Xeon Gold 6252 Figure 5 presents the parallel scalability when running CFX using up to eight CBB nodes configured with Intel Xeon Gold 6252 processors.
ANSYS Fluent Performance 5 ANSYS Fluent Performance ANSYS Fluent is a Computational Fluid Dynamics (CFD) application commonly used across a very wide range of CFD and multi-physics applications. CFD applications typically scale well across multiple processor cores and servers, have modest memory capacity requirements, and typically perform minimal disk I/O while in the solver section. However, some simulations, such as large transient analysis, may have greater I/O demands.
ANSYS Fluent Performance ANSYS Fluent Parallel Scaling—Intel Xeon Gold 6252 Performance Relative to 48 Cores 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.
ANSYS Mechanical Performance 6 ANSYS Mechanical Performance ANSYS Mechanical is a multi-physics Finite Element Analysis (FEA) software commonly used in multiple engineering disciplines. Depending on the specific problem types, FEA codes may or may not scale well across multiple processor cores and servers. Implicit FEA problems often place large demands on the memory and disk I/O sub-systems, particularly for out-of-core solutions, where the problem is too large to fit into the available system RAM.
ANSYS Mechanical Performance The performance of the ten standard ANSYS Mechanical v19 benchmark cases were evaluated on the reference system. The ten benchmark cases all run in-core with the 192GB of RAM that is available per compute node on the reference system, so the local disk configuration has minimal performance impact on the standard benchmarks. Two types of solvers are available with ANSYS Mechanical: Distributed Memory Parallel (DMP) and Shared Memory Parallel (SMP).
ANSYS Mechanical Performance Single Server Relative Performance—ANSYS Mechanical 2019R1 Performance Relative to Gold 6242 1.4 1.2 1.0 0.8 0.6 0.4 0.2 V19sp-1 V19sp-2 E5-2697A v4 V19sp-3 Gold 6142 Gold 6242 V19sp-4 Gold 6248 V19sp-5 Gold 6252 Figure 9 Single Server Relative Performance—ANSYS Mechanical 2019R1 Figure 10 shows the scaling behavior of the ANSYS Mechanical benchmarks on a single server.
ANSYS Mechanical Performance increases for most of the benchmark models. This data shows that using 32 cores per node works well for the standard benchmark models. Performance results for the ANSYS Mechanical solver using multiple servers are shown in Figure 11. The results are plotted relative to the benchmark performance when using a single server.
ANSYS Mechanical Performance Table 7 PowerEdge C4140 System Configuration Platform Processor Dell EMC PowerEdge C4140 Dual Intel Xeon Gold 6148 Memory 12 x 16GB 2666 MTps DIMMS (192 GB) GPU 4 x NVIDIA® Tesla® V100 32GB SXM2 Red Hat Enterprise Linux Server 7.4 OS CUDA Toolkit 9.2.88 The base software license for ANSYS Mechanical allows using up to four CPU cores and/or GPUs. For example, 4 CPU cores or 2 CPU cores plus 2 GPUs would both consume one software license.
ANSYS Mechanical Performance Server Performance with NVIDIA Tesla V100 GPUs—ANSYS Mechanical 2019R1 2.0 Performance Relative to CPU Only 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.
Basic Building Block Performance 7 Basic Building Block Performance The performance of systems created with Basic Build Blocks (BBB’s) was tested for ANSYS CFX, Fluent and Mechanical. Performance was measured on both a single BBB and a BBB couplet composed of two BBB’s with a direct high-speed network connection. Testing was performed using both RHEL 7 and Windows 2016 Enterprise Edition.
Basic Building Block Performance The results using Windows are not quite as good as with Linux. On average, the performance of a single Windows based BBB is about 1.5X that of the Linux based CBB, and the Windows 25GbE based couplet delivers about 2.5X the baseline CBB performance. However, for customers comfortable with managing Windows systems, this offers a significant performance potential over most existing Windows based workstations.
Basic Building Block Performance 3.5 BBB performance with ANSYS Mechanical 3.0 Performance relative to one EBB 2.5 2.0 1.5 1.0 0.5 0.0 V19cg-1 V19cg-2 V19cg-3 V19ln-1 V19ln-2 V19sp-1 V19sp-2 V19sp-3 V19sp-4 V19sp-5 Linux 32-cores Linux 64-core Windows 32-cores Windows 64-cores Figure 16 BBB Performance with ANSYS Mechanical Unlike the Fluent and CFX performance figures, the ANSYS Mechanical performance figure shows the performance of only a single BBB server.
Conclusion 8 Conclusion This technical white paper presents the Dell EMC Ready Solution for HPC Digital Manufacturing. The detailed analysis of the building block configurations demonstrate that the system is architected for a specific purpose—to provide a comprehensive HPC solution for the manufacturing domain. Use of this building block approach allows customers to easily deploy an HPC system optimized for their specific workload requirements.