Parallel Programming Guide for HP-UX Systems

Troubleshooting
False cache line sharing
Chapter 9 181
DO I = 1, N
Y(I) = ...
ENDDO
is transformed to:
C$DIR NO_PARALLEL
DO I = 1, MIN (LREM, N) ! 0 <= LREM < 8
Y(I) = ...
ENDDO
C$DIR PREFER_PARALLEL (CHUNK_SIZE = 16)
DO I = LREM+1, N
! Y(LREM+1) IS ON A CACHE LINE BOUNDARY
Y(I) = ...
ENDDO
The first loop takes care of elements from the first (if any) partial cache
line of data. The second loop begins on a cache line boundary, and is
controlled with CHUNK_SIZE to avoid false sharing among the threads.
Working with dependences
Data dependences in loops may prevent parallelization and prevent the
elimination of false cache line sharing. If certain conditions are met,
some performance gains are achieved.
For example, consider the following code:
COMMON /ALIGNED / P(128,128), Q(128,128), R(128,128)
REAL*4 P, Q, R
DO J = 2, 128
DO I = 2, 127
P(I-1,J) = SQRT (P(I-1,J-1) + 1./3.)
Q(I ,J) = SQRT (Q(I ,J-1) + 1./3.)
R(I+1,J) = SQRT (R(I+1,J-1) + 1./3.)
ENDDO
ENDDO
Only the I loop is parallelized, due to the loop-carried dependences in the
J loop. It is impossible to distribute the iterations so that there is no false
cache line sharing in the above loop. If all loops that refer to these arrays
always use the same offsets (which is unlikely) then you could make
dimension adjustments that would allow a better iteration distribution.
For example, the following would work well for 8 threads: