Parallel Programming Guide for HP-UX Systems

Troubleshooting
False cache line sharing
Chapter 9 177
REAL*4 B(112,100)
COMMON /ALIGNED/ B
C$DIR PREFER_PARALLEL (CHUNK_SIZE=16)
DO I = 1, 100
DO J = 1, 100
B(I,J) = ...B(I,J-1)...
ENDDO
ENDDO
You must specify a constant CHUNK_SIZE attribute. However, the ideal is
to distribute work so that all but one thread works on the same number
of whole cache lines, and the remaining thread works on any partial
cache line. For example, given the following:
NITS = number of iterations
NTHDS = number of threads
LSIZE = line size in words (8 for 4-byte data, 4 for 8-byte data, 2 for
16-byte data) size in words (8 for 4-byte data
the ideal CHUNK_SIZE would be:
CHUNK_SIZE = LSIZE * (1 + ( (1 + (NITS - 1) / LSIZE ) - 1
)/NTHDS)
For the code above, these numbers are:
NITS = 100
LSIZE = 8 (aligns on V2250 boundaries for 4-byte data)
NTHDS =8
CHUNK_SIZE = 8 * (1 + ( (1 + (100 - 1) / 8 ) - 1) / 8)
= 8 * (1 + ( (1 + 12 ) - 1) / 8)
= 8 * (1 + ( 12 ) / 8)
= 8 * (1 + 1 )
= 16
CHUNK_SIZE = 16 causes threads 0, 1, ..., 6 to execute iterations 1-16,
17-32, ..., 81-96, respectively. Thread 7 executes iterations 97-100. As a
result there is no false cache line sharing, and parallel performance is
greatly improved.
You cannot specify the ideal CHUNK_SIZE for every loop. However, using
CHUNK_SIZE = x