Reference Guide

Altair AcuSolve Performance
15 Dell EMC Ready Solutions for HPC Digital Manufacturing with AMD EPYC™ ProcessorsAltair Performance
These benchmarks were carried out on a cluster of eight servers, each with dual 7452 processors. The
results are presented in relative performance compared with the single node results. On the surface these
results appear surprising for the “Riser” model where the performance increases by more than a factor of 2X
going from one to two nodes. However, this behavior can be explained by “cache effects”, where when the
data set is distributed among a greater number of nodes, there can be a point where the entire problem can fit
into cache, and the speed of the solver can increase dramatically. Such cache effects are highly problem
specific. In general, there is a tradeoff in distributed memory parallelism where the cache performance
typically improves as the problem is distributed to more nodes, but the communication overhead also
increases, counteracting the increased performance from the caching benefit. The largest model “Nozzle”
displays nearly linear parallel scaling up to 8 nodes. The other models show limited performance
improvement with 4 nodes or above (256 cores and greater).
AcuSolve is a hybrid parallel application, where it is possible to use both shared memory parallelism within a
node and distributed memory parallelism both within a node an across nodes. Finding the proper balance
between shared memory and distributed memory parallelism within a node can be daunting. Figure 6 shows
the parallel performance for these models on the cluster used for Figure 5, where the number of shared
memory parallel threads is adjusted from 1 to 8 threads per domain, where the number of domains per server
was the divisor of the total number of cores per server with the number of threads per domain.
1.0
2.0
4.0
8.0
64(1) 128(2) 256(4) 512(8)
Performance Relative to 64 Cores
Number of Cores (Number of Nodes)
Figure 5: AcuSolve Parallel Scaling
Riser Windmill Nozzle