Parallel Programming Guide for HP-UX Systems

Troubleshooting

Triangular loops

Chapter 9 193

C$DIR PREFER_PARALLEL (CHUNK_SIZE = 1)

DO J = 1, N

DO I = J+1, N

Y(I,J) = Y(I,J) + ...X(I,J)...

This causes each thread to execute in the following manner:

DO J = MY_THREAD() + 1, N, NUM_THREADS()

DO I = J+1, N

Y(I,J) = Y(I,J) + ...X(I,J)...

where 0 <= MY_THREAD() < NUM_THREADS()

In this case, the ﬁrst thread still does more work than the last, but the

imbalance is greatly reduced. For example, assume N = 128 and there

are 8 threads. Then the default parallel compilation would cause thread

0 to do J = 1 to 16, resulting in 1912 inner iterations, whereas thread 7

does J = 113 to 128, resulting in 120 inner iterations. With

chunk_size = 1, thread 0 does 1072 inner iterations, and thread 7 does

1023.

Parallelizing the inner loop

If the outer loop cannot be parallelized, it is recommended that you

parallelize the inner loop if possible. There are two issues to be aware of

when parallelizing the inner loop:

• Cache thrashing

Consider the parallelization of the following inner loop:

DO J = I+1, N

F(J) = F(J) + SQRT(A(J)**2 - B(I)**2)

where I varies in the outer loop iteration.

The default iteration distribution has each thread processing a

contiguous chunk of iterations of approximately the same number as

every other thread. The amount of work per thread is about the

same; however, from one outer iteration to the next, threads work on

different elements in F, resulting in cache thrashing.

• The overhead of parallelization

If the loop cannot be interchanged to be outermost (or at least

outermore), then the overhead of parallelization is compounded by

the number of outer loop iterations.