Parallel Programming Guide for HP-UX Systems

Troubleshooting
Triangular loops
Chapter 9198
Determines, for a given value of I+1, which NTCHUNK
the value I+1 falls then. Then it assigns a unique
CHUNK of it to each thread ID. Suppose that there are
ntc NTCHUNKs, where ntc is approximately N/NTCHUNK.
Then the expression:
(I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK)
returns a value in the range 1 to ntc for a given value of
I+1. Then the expression:
((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK
identifies the start of an NTCHUNK that contains I+1 or
is immediately above I+1 for a given value of ID.
For the NTCHUNK that contains I+1, if the cache lines
owned by a thread either contain I+1 or are above I+1
in memory, this expression returns this NTCHUNK. If the
cache lines owned by a thread are below I+1 in this
NTCHUNK, this expression returns the next highest
NTCHUNK. In other words, if there is no work for a
particular thread to do in this NTCHUNK, then start
working in the next one.
(ID-1) * CHUNK + 1
identifies the start of the particular cache line for the
thread to compute within this NTCHUNK.
DO JJ = JS, N, NTCHUNK
runs a unique set of cache lines starting at its specific
JS and continuing into succeeding NTCHUNKs until all
the work is done.
DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1)
performs the work within a single cache line. If the
starting index (I+1) is greater than the first element in
the cache line (JS) then start with I+1. If the ending
index (N) is less than the last element in the cache line,
then finish with N.
The following are observations of the preceding loops:
Most of the “complicated” arithmetic is an outer loop iterations.