Parallel Programming Guide for HP-UX Systems

Troubleshooting

Triangular loops

Chapter 9192

While the compiler can usually auto-parallelize one of the outer or inner

loops, there are typically performance problems in either case:

• If the outer loop is parallelized by assigning contiguous chunks of

iterations to each of the threads, the load is severely unbalanced. For

example, in the lower triangular example above, the thread doing the

last chunk of iterations does far less work than the thread doing the

ﬁrst chunk.

• If the inner loop is auto-parallelized, then on each outer iteration in

the J loop, the threads are assigned to work on a different set of

iterations in the I loop, thus losing access to some of their previously

encached elements of F and thrashing each other’s caches in the

process.

By manually controlling the parallelization, you can greatly improve the

performance of a triangular loop. Parallelizing the outer loop is generally

more beneﬁcial than parallelizing the inner loop. The next two sections

explain how to achieve the enhanced performance.

Parallelizing the outer loop

Certain directives allow you to control the parallelization of the outer

loop in a triangular loop to optimize the performance of the loop nest.

For the outer loop, assign iterations to threads in a balanced manner.

The simplest method is to assign the threads one at a time using the

CHUNK_SIZE attribute:

Elements

referenced

in array X

(shaded cells)

...

321