Parallel Programming Guide for HP-UX Systems

ManualsBrandsHP ManualsSoftwareHP-UX Performance Tools

211

212

213

214

215

216

217

218

219

220

Troubleshooting

Triangular loops

Chapter 9194

The scheme below assigns “ownership” of elements to threads on a cache

line basis so that threads always work on the same cache lines and retain

data locality from one iteration to the next. In addition, the parallel

directive is used to spawn threads just once. The outer, nonparallel loop

is replicated on all processors, and the inner loop iterations are manually

distributed to the threads.

C F IS KNOWN TO BEGIN ON A CACHE LINE BOUNDARY

NTHD = NUM_THREADS()

CHUNK = 8 ! CHUNK * DATA SIZE (4

BYTES)

! EQUALS PROCESSOR CACHE

LINE SIZE;

! A SINGLE THREAD WORKS

ON CHUNK = 8

! ITERATIONS AT A TIME

NTCHUNK = NTHD * CHUNK ! A CHUNK TO BE SPLIT AMONG

THE THREADS

...

C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I)

ID = MY_THREAD() + 1 ! UNIQUE THREAD ID

DO I = 1, N

JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) *

NTCHUNK

> + (ID-1) * CHUNK + 1

DO JJ = JS, N, NTCHUNK

DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1)

F(J) = F(J) + SQRT(A(J)**2 - B(I)**2)

ENDDO

C$DIR END_PARALLEL

The idea is to assign a ﬁxed ownership of cache lines of F and to assign a

distribution of those cache lines to threads that keeps as many threads

busy computing whole cache lines for as long as possible. Using

CHUNK = 8 for 4-byte data makes each thread work on 8 iterations

covering a total of 32 bytes—the processor cache line size for V2250

servers.

In general, set CHUNK equal to the smallest value that multiplies by the

data size to give a multiple of 32 (the processor cache line size on V2250

servers). Smaller values of CHUNK keep most threads busy most of the

time.