Parallel Programming Guide for HP-UX Systems

ManualsBrandsHP ManualsSoftwareHP-UX Performance Tools

191

192

193

194

195

196

197

198

199

200

Troubleshooting

False cache line sharing

Chapter 9 181

DO I = 1, N

Y(I) = ...

ENDDO

is transformed to:

C$DIR NO_PARALLEL

DO I = 1, MIN (LREM, N) ! 0 <= LREM < 8

Y(I) = ...

ENDDO

C$DIR PREFER_PARALLEL (CHUNK_SIZE = 16)

DO I = LREM+1, N

! Y(LREM+1) IS ON A CACHE LINE BOUNDARY

Y(I) = ...

ENDDO

The ﬁrst loop takes care of elements from the ﬁrst (if any) partial cache

line of data. The second loop begins on a cache line boundary, and is

controlled with CHUNK_SIZE to avoid false sharing among the threads.

Working with dependences

Data dependences in loops may prevent parallelization and prevent the

elimination of false cache line sharing. If certain conditions are met,

some performance gains are achieved.

For example, consider the following code:

COMMON /ALIGNED / P(128,128), Q(128,128), R(128,128)

REAL*4 P, Q, R

DO J = 2, 128

DO I = 2, 127

P(I-1,J) = SQRT (P(I-1,J-1) + 1./3.)

Q(I ,J) = SQRT (Q(I ,J-1) + 1./3.)

R(I+1,J) = SQRT (R(I+1,J-1) + 1./3.)

ENDDO

Only the I loop is parallelized, due to the loop-carried dependences in the

J loop. It is impossible to distribute the iterations so that there is no false

cache line sharing in the above loop. If all loops that refer to these arrays

always use the same offsets (which is unlikely) then you could make

dimension adjustments that would allow a better iteration distribution.

For example, the following would work well for 8 threads: