Parallel Programming Guide for HP-UX Systems

Troubleshooting
False cache line sharing
Chapter 9182
COMMON /ADJUSTED/ P(128,128), PAD1(15), Q(128,128),
> PAD2(15), R(128,128)
DO J = 2, 128
C$DIR PREFER_PARALLEL (CHUNK_SIZE=16)
DO I = 2, 127
P(I-1,J) = SQRT (P(I-1,J-1) + 1./3.)
Q(I ,J) = SQRT (Q(I ,J-1) + 1./3.)
R(I+1,J) = SQRT (R(I+1,J-1) + 1./3.)
ENDDO
ENDDO
Padding 60 bytes before the declarations of both Q and R causes the
P(1,J), Q(2,J), and R(3,J) to be aligned on 64-byte boundaries for all
J. Combined with a CHUNK_SIZE of 16, this causes threads to assign data
to unique whole cache lines.
You can usually find a mix of all the above problems in some
CPU-intensive loops. You cannot avoid all false cache line sharing, but by
careful inspection of the problems and careful application of some of the
workarounds shown here, you can significantly enhance the performance
of your parallel loops.