Technical White Paper Dell EMC Ready Solution for HPC Digital Manufacturing— Altair Performance Abstract This Dell EMC technical white paper discusses performance benchmarking results and analysis for Altair HyperWorks™ on the Dell EMC Ready Solution for HPC Digital Manufacturing.
Revisions Revisions Date Description April 2019 Initial release Acknowledgements This paper was produced by the following: Authors: Joshua Weage Martin Feyereisen The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Table of contents Table of contents Revisions.............................................................................................................................................................................2 Acknowledgements .............................................................................................................................................................2 Table of contents .................................................................................................
Introduction 1 Introduction This technical white paper discusses the performance of various Altair HyperWorks™ products, including Altair OptiStruct™, Altair Radioss™, Altair AcuSolve™, and Altair Feko™, on the Dell EMC Ready Solution for HPC Digital Manufacturing, with benchmarking workload management by Altair PBS Professional™.
System Building Blocks 2 System Building Blocks The Dell EMC Ready Solution for HPC Digital Manufacturing is designed using preconfigured building blocks. The building block architecture allows an HPC system to be optimally designed for specific end-user requirements, while still making use of standardized, domain-specific system recommendations. The available building blocks are infrastructure servers, storage, networking, and compute building blocks.
System Building Blocks A recommended base configuration for infrastructure servers is: • • • • • • • • Dell EMC PowerEdge R640 server Dual Intel® Xeon® Bronze 3106 processors 192 GB of RAM (12 x 16GB 2667 MTps DIMMs) PERC H330 RAID controller 2 x 480GB Mixed-Use SATA SSD RAID 1 Dell EMC iDRAC9 Enterprise 2 x 750 W power supply units (PSUs) Mellanox EDR InfiniBandTM (optional) The recommended base configuration for the infrastructure server is described as follows.
System Building Blocks Table 1 Recommended Configurations for the Compute Building Block Platforms Processors Dual Intel Xeon Gold 6242 (16 cores per socket) Dual Intel Xeon Gold 6248 (20 cores per socket) Dual Intel Xeon Gold 6252 (24 cores per socket) Memory Options 192 GB (12 x 16GB 2933 MTps DIMMs) 384 GB (12 x 32GB 2933 MTps DIMMs) 768 GB (24 x 32GB 2933 MTps DIMMs, R640 only) Storage Options PERC H330, H730P or H740P RAID controller 2 x 480GB Mixed-Use SATA SSD RAID 0 4 x 480GB Mixed-Use SATA SS
System Building Blocks SSD’s in RAID 0 are used to provide fast local I/O. The compute nodes do not normally require extensive OOB management capabilities; therefore, an iDRAC9 Express is recommended. Additionally, two BBB’s can be directly coupled together via a high-speed network cable, such as InfiniBand or Ethernet, without need of an additional high-speed switch if additional compute capability is required for each simulation run (HPC Couplet).
System Building Blocks • • • • • • • • • • Dell EMC PowerEdge R740xd server Dual Intel® Xeon® Bronze 4110 processors 96 GB of memory, 12 x 8GB 2667 MT/s DIMMS PERC H730P RAID controller 2 x 250GB Mixed-use SATA SSD in RAID-1 (For OS) 12 x 12TB 3.5: nlSAS HDDs in RAID-6 (for data) Dell EMC iDRAC9 Express 2 x 750 W power supply units (PSUs) Mellanox EDR InfiniBand Adapter Site specific high-speed Ethernet Adapter(optional) This server configuration would provide 144TB of raw storage.
System Building Blocks Figure 3 Dell EMC Ready Solution for Lustre Storage Reference Architecture 2.5 System Networks Most HPC systems are configured with two networks—an administration network and a high-speed/lowlatency switched fabric. The administration network is typically Gigabit Ethernet that connects to the onboard LOM/NDC of every server in the cluster. This network is used for provisioning, management and administration.
System Building Blocks 2.8 Workload Management Workload management and job scheduling on the Dell EMC Ready Solution for HPC Digital Manufacturing can be handled efficiently with Altair PBS Professional, part of the Altair PBS Works™ suite. PBS Professional features include policy-based scheduling, OS provisioning, shrink-to-fit jobs, preemption, and failover. Its topology-aware scheduling optimizes task placement, improving application performance and reducing network contention.
Reference System 3 Reference System The reference system was assembled in the Dell EMC HPC and AI Innovation Lab using the building blocks described in section 2. The building blocks used for the reference system are listed in Table 2.
Reference System Table 4 Software Versions 13 Component Version Operating System Kernel RHEL 7.6 Windows Server 2016 (BBB) 3.10.0-957.el7.x86_64 OFED Mellanox 4.5-1.0.1.0 Bright Cluster Manager 8.2 Altair PBS Professional 18.1.2 Altair OptiStruct 2017.2.3 Altair Radioss 2017.2.3 Altair AcuSolve 2017.
Altair AcuSolve Performance 4 Altair AcuSolve Performance Altair AcuSolve is a Computational Fluid Dynamics (CFD) tool commonly used across a very wide range of CFD and multi-physics applications. AcuSolve is a robust solver with proprietary numerical methods that yield stable simulations and accurate results regardless of the quality and topology of mesh elements.
Altair AcuSolve Performance Performance Relative to 48 Cores Figure 5: AcuSolve Parallel Scaling 8.0 Riser Windmill Nozzle 4.0 2.0 1.0 48 (1) 96(2) 192 (4) 384(8) Number of Cores (Number of Nodes) These benchmarks were carried out on a cluster of eight servers, each with 6252 processors. The results are presented in relative performance compared with the single node results.
Altair AcuSolve Performance Figure 6: AcuSolve Hybrid Parallel Scaling on EBB Performance relative to 32 cores 8.00 R-1 W-1 N-1 R-2 W-2 N-2 R-4 W-4 N-4 R-8 W-8 N-8 4.00 2.00 1.00 48(1) 96(2) 192(4) 384(8) Number of cores (number of nodes) Again, the Riser(R) model shows better overall parallel scaling than the larger Windmill(W) and Nozzle(N) models, primarily from cache effects. All models display similar behavior when the number of shared memory threads is varied.
Altair Radioss Performance 5 Altair Radioss Performance Altair Radioss is a leading structural analysis solver for highly non-linear problems under dynamic loadings. It is used across all industries worldwide to improve the crashworthiness, safety, and manufacturability of structural designs. Radioss is similar to AcuSolve in that it typically scales well across multiple processor cores and servers, has modest memory capacity requirements, and performs minimal disk I/O while in the solver section.
Altair Radioss Performance Figure 8 presents the Radioss parallel performance with parallel benchmarks run across up to 8 servers. Performance Relative to sinlge node 6142 Figure 8: Radioss Parallel Performance 8.0 4.0 Neon-6142 Taurus-6142 Neon-6252 Taurus-6252 2.0 1.0 1-node 2-node 4-node 8-node Number of Cores (Number of Nodes) The figure uses the performance reference of 1.0 for a single node equipped with the Intel 6142.
Altair Radioss Performance Here, the results are significantly different than the results obtained with the hybrid parallel version of AcuSolve. For the larger Taurus (T) model, the best performance was always obtained using a single shared memory thread (the same as non-hybrid distributed memory MPI). For the smaller Neon model, there was a small benefit using more than one thread at four or more nodes.
Altair OptiStruct Performance 6 Altair OptiStruct Performance Altair OptiStruct is a multi-physics Finite Element Analysis (FEA) solver commonly used in multiple engineering disciplines. Based on finite-element and multi-body dynamics technology, and through advanced analysis and optimization algorithms, OptiStruct helps designers and engineers rapidly develop innovative, lightweight and structurally efficient designs.
Altair OptiStruct Performance benchmark model and server. Typically, OptiStruct scales better using more DMP partitions per node with fewer SMP threads each as compared with using fewer DMP partitions and more SMP threads per partition. However, typically memory and I/O requirements increase with more DMP partitions, such that it is possible to run out of memory or create I/O bottlenecks with too many DMP partitions per node.
Altair OptiStruct Performance There are benchmarks which may perform well without such a high-performance network, but such a solution would not be considered robust for a variety of caseloads. Figure 12 presents the relative performance for OptiStruct with the Engine model using the DMP solver across multiple servers. Figure 12: OptiStruct DMP scalability across nodes Performance Relative to 6142 2.00 1.50 1.00 0.50 0.
Altair Feko Performance 7 Altair Feko Performance Altair Feko is a comprehensive computational electromagnetics (CEM) solution used widely in the telecommunications, automobile, aerospace and defense industries. Feko offers several frequency and time domain EM solvers under a single license. Hybridization of these methods enables the efficient analysis of a broad spectrum of EM problems, including the analysis of antennas, microstrip circuits, RF components and biomedical systems.
Altair Feko Performance Performance Relative to 1-node 6142 Figure 14: Feko Sedan (MoM) Parallel Scaling 4.0 700_6142 700_6252 1100_6142 1100_6252 1600_6142 1600_6252 2.0 1.0 1-node 2-node 4-node 8-node All the models show good scaling when running a single job with up to 8 nodes. It should be noted that Feko can efficiently distribute the problem datasets across nodes for larger direct full wave problems. The larger sedan examples solved at 1.6 GHz which could not fit within a single 192GB node.
Altair Feko Performance Both simulations showed a significant performance improvement up to all 32 cores on the system.
Basic Building Block Performance 8 Basic Building Block Performance We tested the performance of systems created with Basic Building Blocks (BBB’s) for Altair AcuSolve, Radioss, and OptiStruct. We tested performance on both a single BBB and a BBB couplet composed of two BBB’s with a direct high-speed network connection. We tested with both RHEL Linux and Windows Server 2016 Enterprise Edition.
Basic Building Block Performance Figure 17: BBB performance for AcuSolve 5 Performance Relative to EBB 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Riser Linux-1Node Windmill Linux-2Node Windows-1Node Windows-2Node Here, the two node results for the Riser benchmark are artificially high, due to the effect of being able to fit this model into cache on two BBB’s, but not into the reference 14G EBB server. This effect can be seen as well for the parallel speedup of this model on a system of EBB in Figure 4.
Conclusion 9 Conclusion This technical white paper presents the Dell EMC Ready Solution for HPC Digital Manufacturing. The detailed analysis of the building block configurations demonstrate that the system is architected for a specific purpose—to provide a comprehensive HPC solution for the manufacturing domain. Use of this building block approach allows customers to easily deploy an HPC system optimized for their specific workload requirements.