Parallel Programming Guide for HP-UX Systems

Troubleshooting
Triangular loops
Chapter 9 193
C$DIR PREFER_PARALLEL (CHUNK_SIZE = 1)
DO J = 1, N
DO I = J+1, N
Y(I,J) = Y(I,J) + ...X(I,J)...
This causes each thread to execute in the following manner:
DO J = MY_THREAD() + 1, N, NUM_THREADS()
DO I = J+1, N
Y(I,J) = Y(I,J) + ...X(I,J)...
where 0 <= MY_THREAD() < NUM_THREADS()
In this case, the first thread still does more work than the last, but the
imbalance is greatly reduced. For example, assume N = 128 and there
are 8 threads. Then the default parallel compilation would cause thread
0 to do J = 1 to 16, resulting in 1912 inner iterations, whereas thread 7
does J = 113 to 128, resulting in 120 inner iterations. With
chunk_size = 1, thread 0 does 1072 inner iterations, and thread 7 does
1023.
Parallelizing the inner loop
If the outer loop cannot be parallelized, it is recommended that you
parallelize the inner loop if possible. There are two issues to be aware of
when parallelizing the inner loop:
Cache thrashing
Consider the parallelization of the following inner loop:
DO J = I+1, N
F(J) = F(J) + SQRT(A(J)**2 - B(I)**2)
where I varies in the outer loop iteration.
The default iteration distribution has each thread processing a
contiguous chunk of iterations of approximately the same number as
every other thread. The amount of work per thread is about the
same; however, from one outer iteration to the next, threads work on
different elements in F, resulting in cache thrashing.
The overhead of parallelization
If the loop cannot be interchanged to be outermost (or at least
outermore), then the overhead of parallelization is compounded by
the number of outer loop iterations.