Technical data

Optimizing with MACRO-32

aggregate speedup of each because of these qualities; however, both CPU

time and wall-clock time can be reduced most dramatically when vector

and parallel processing are combined.

The qualities involved are as follows:

• Large amounts of vector load-stores can create a bottleneck in

the system. On the other hand, small amounts of CPU work can

cause the parallel processing startup overhead itself to become a

bottleneck. Vector operations have smaller startup overhead than

parallel processing, so they amortize this CPU expense much sooner.

However, vector processing demands more from memory than parallel

processing (on scalar CPUs) because one vector load or store can affect

up to 64 elements, whereas a scalar load or store typically affects only

one element.

• Vector processing is "free" for the scalar CPUs because it is done on

a vector processor; both wall-clock time and scalar CPU time are

decreased. On the other hand, parallel processing is not free for the

scalar CPUs; it can never decrease CPU time. But it can reduce

wall-clock time more dramatically than vector processing.

Vectorization can be effectively combined with decomposition:

1 Compile, debug, and run the program serially and in scalar.

2 Evaluate the algorithm and make suitable changes.

3 Unless your algorithm and system environment are especially suitable

for parallel processing or you have already decomposed the program,

compile, debug, and run the program using /VECTOR ﬁrst. This is

because vectorization is "free," as stated in this section.

4 Using /VECTOR/PARALLEL=AUTOMATIC, recompile, debug, and

run the program.

5 Evaluate performance.

• If performance is adequate, stop.

• If performance is inadequate, review the /SHOW=LOOPS output

and LSE diagnostics and modify the source code as needed for

important loops that neither vectorized nor decomposed (this

most probably will include adding assertions to resolve unknown

dependencies). Then retest the program. If performance is still

not acceptable, consider manually decomposing certain loops

and look for other bottlenecks such as I/O or other performance

inhibitors.

3–4