Parallel Programming Guide for HP-UX Systems

MPI
Tuning
Chapter 2 69
When a host is over subscribed, application performance decreases because of increased
context switching.
Context switching can degrade application performance by slowing the computation phase,
increasing message latency, and lowering message bandwidth. Simulations that use
timing–sensitive algorithms can produce unexpected or erroneous results when run on an
over-subscribed system.
In a situation where your system is oversubscribed but your MPI application is not, you can
use gang scheduling to improve performance.
MPI routine selection
To achieve the lowest message latencies and highest message bandwidths for point-to-point
synchronous communications, use the MPI blocking routines MPI_Send and MPI_Recv. For
asynchronous communications, use the MPI nonblocking routines MPI_Isend and MPI_Irecv.
When using blocking routines, try to avoid pending requests. MPI must advance nonblocking
messages, so calls to blocking receives must advance pending requests, occasionally resulting
in lower application performance.
For tasks that require collective operations, use the appropriate MPI collective routine. HP
MPI takes advantage of shared memory to perform efficient data movement and maximize
your application’s communication performance.
Multilevel parallelism
There are several ways to improve the performance of applications that use multilevel
parallelism:
Use the MPI library to provide coarse-grained parallelism and a parallelizing compiler to
provide fine-grained (that is, thread-based) parallelism. An appropriate mix of coarse- and
fine-grained parallelism provides better overall performance.
Assign only one multithreaded process per host when placing application processes. This
ensures that enough processors are available as different process threads become active.
Over subscribed More active processes than
processors
Table 2-7 Subscription types (Continued)
Subscription type Description