Parallel Programming Guide for HP-UX Systems

Troubleshooting
False cache line sharing
Chapter 9176
Aligning data to avoid false sharing
Because false cache line sharing is partially due to the layout of the data,
one step in avoiding it is to adjust the layout. Adjustments are typically
made by aligning data on cache line boundaries. Aligning arrays
generally improves performance. However, it can occasionally decrease
performance.
The second step in avoiding false cache line sharing is to adjust the
distribution of loop iterations. This is covered in “Distributing iterations
on cache line boundaries” on page 176.
Aligning arrays on cache line boundaries
Note the assumption that in the previous example, array B starts on a
cache line boundary. The methods below force arrays in Fortran to start
on cache line boundaries:
Using uninitialized COMMON blocks (blocks with no DATA statements).
These blocks start on 64-byte boundaries.
Using ALLOCATE statements. These statements return addresses on
64-byte boundaries. This only applies to parallel executables.
The methods below force arrays in C to start on cache line boundaries:
Using the functions malloc or memory_class_malloc. These
functions return pointers on 64-byte boundaries.
Using uninitialized global arrays or structs that are at least 32 bytes.
Such arrays and structs are aligned on 64-byte boundaries.
Using uninitialized data of the external storage class in C that is at
least 32 bytes. Data is aligned on 64-byte boundaries.
Distributing iterations on cache line boundaries
Recall that the default iteration distribution causes thread 0 to work on
iterations 1-12 and thread 1 to work on iterations 13-25, and so on. Even
though the cache lines are aligned across the columns of the array (see
*** 'HP compilers, by default, give each thread about the same number of
iterations, assigning (if necessary) one extra iteration to some threads.
This happens until all iterations are assigned to a thread. Table 9-1
shows the default distribution of the I loop across 8 threads.' on page 175
***), the iteration distribution still needs to be changed. Use the
CHUNK_SIZE attribute to change the distribution: