Parallel Programming Guide for HP-UX Systems

Troubleshooting
Triangular loops
Chapter 9 197
The scheme above causes thread 0 to do all work associated with the
cache lines starting at F(1), F(1+NTCHUNK), F(1+2*NTCHUNK), and so on.
Likewise, thread 1 does the work associated with the cache lines starting
at F(9), F(9+NTCHUNK), F(9+2*NTCHUNK), and so on.
If a thread assigns certain elements of F for I = 2, then it is certain that
the same thread encached those elements of F in iteration I = 1. This
eliminates cache thrashing among the threads.
Examining the code
Having established the idea of assigning cache line ownership, consider
the following Fortran code in more detail:
C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I)
ID = MY_THREAD() + 1 ! UNIQUE THREAD ID
DO I = 1, N
JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK)
* NTCHUNK
> + (ID-1) * CHUNK + 1
DO JJ = JS, N, NTCHUNK
DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1)
F(J) = F(J) + SQRT(A(J)**2 - B(I)**2)
ENDDO
ENDDO
ENDDO
C$DIR END_PARALLEL
C$DIR PARALLEL, PARALLEL_PRIVATE(ID,JS,JJ,J,I)
Spawns threads, each of which begins executing the
statements in the parallel region. Each thread has a
private version of the variables ID, JS, JJ, J, and I.
ID = MY_THREAD() + 1 ! UNIQUE THREAD ID
Establishes a unique ID for each thread, in the
range 1 to num_threads().
DO I = 1, N
Executes all threads of the I loop redundantly (instead
of thread 0 executing it alone).
JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK
+ (ID-1) * CHUNK + 1