Parallel Programming Guide for HP-UX Systems

Troubleshooting
Triangular loops
Chapter 9194
The scheme below assigns “ownership” of elements to threads on a cache
line basis so that threads always work on the same cache lines and retain
data locality from one iteration to the next. In addition, the parallel
directive is used to spawn threads just once. The outer, nonparallel loop
is replicated on all processors, and the inner loop iterations are manually
distributed to the threads.
C F IS KNOWN TO BEGIN ON A CACHE LINE BOUNDARY
NTHD = NUM_THREADS()
CHUNK = 8 ! CHUNK * DATA SIZE (4
BYTES)
! EQUALS PROCESSOR CACHE
LINE SIZE;
! A SINGLE THREAD WORKS
ON CHUNK = 8
! ITERATIONS AT A TIME
NTCHUNK = NTHD * CHUNK ! A CHUNK TO BE SPLIT AMONG
THE THREADS
...
C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I)
ID = MY_THREAD() + 1 ! UNIQUE THREAD ID
DO I = 1, N
JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) *
NTCHUNK
> + (ID-1) * CHUNK + 1
DO JJ = JS, N, NTCHUNK
DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1)
F(J) = F(J) + SQRT(A(J)**2 - B(I)**2)
ENDDO
ENDDO
ENDDO
C$DIR END_PARALLEL
The idea is to assign a fixed ownership of cache lines of F and to assign a
distribution of those cache lines to threads that keeps as many threads
busy computing whole cache lines for as long as possible. Using
CHUNK = 8 for 4-byte data makes each thread work on 8 iterations
covering a total of 32 bytes—the processor cache line size for V2250
servers.
In general, set CHUNK equal to the smallest value that multiplies by the
data size to give a multiple of 32 (the processor cache line size on V2250
servers). Smaller values of CHUNK keep most threads busy most of the
time.