Parallel Programming Guide for HP-UX Systems

Troubleshooting
Triangular loops
Chapter 9192
While the compiler can usually auto-parallelize one of the outer or inner
loops, there are typically performance problems in either case:
If the outer loop is parallelized by assigning contiguous chunks of
iterations to each of the threads, the load is severely unbalanced. For
example, in the lower triangular example above, the thread doing the
last chunk of iterations does far less work than the thread doing the
first chunk.
If the inner loop is auto-parallelized, then on each outer iteration in
the J loop, the threads are assigned to work on a different set of
iterations in the I loop, thus losing access to some of their previously
encached elements of F and thrashing each other’s caches in the
process.
By manually controlling the parallelization, you can greatly improve the
performance of a triangular loop. Parallelizing the outer loop is generally
more beneficial than parallelizing the inner loop. The next two sections
explain how to achieve the enhanced performance.
Parallelizing the outer loop
Certain directives allow you to control the parallelization of the outer
loop in a triangular loop to optimize the performance of the loop nest.
For the outer loop, assign iterations to threads in a balanced manner.
The simplest method is to assign the threads one at a time using the
CHUNK_SIZE attribute:
Elements
referenced
in array X
(shaded cells)
...
J
3
...
I
1
2
321