Technical data
Optimizing with MACRO-32
aggregate speedup of each because of these qualities; however, both CPU
time and wall-clock time can be reduced most dramatically when vector
and parallel processing are combined.
The qualities involved are as follows:
• Large amounts of vector load-stores can create a bottleneck in
the system. On the other hand, small amounts of CPU work can
cause the parallel processing startup overhead itself to become a
bottleneck. Vector operations have smaller startup overhead than
parallel processing, so they amortize this CPU expense much sooner.
However, vector processing demands more from memory than parallel
processing (on scalar CPUs) because one vector load or store can affect
up to 64 elements, whereas a scalar load or store typically affects only
one element.
• Vector processing is "free" for the scalar CPUs because it is done on
a vector processor; both wall-clock time and scalar CPU time are
decreased. On the other hand, parallel processing is not free for the
scalar CPUs; it can never decrease CPU time. But it can reduce
wall-clock time more dramatically than vector processing.
Vectorization can be effectively combined with decomposition:
1 Compile, debug, and run the program serially and in scalar.
2 Evaluate the algorithm and make suitable changes.
3 Unless your algorithm and system environment are especially suitable
for parallel processing or you have already decomposed the program,
compile, debug, and run the program using /VECTOR first. This is
because vectorization is "free," as stated in this section.
4 Using /VECTOR/PARALLEL=AUTOMATIC, recompile, debug, and
run the program.
5 Evaluate performance.
• If performance is adequate, stop.
• If performance is inadequate, review the /SHOW=LOOPS output
and LSE diagnostics and modify the source code as needed for
important loops that neither vectorized nor decomposed (this
most probably will include adding assertions to resolve unknown
dependencies). Then retest the program. If performance is still
not acceptable, consider manually decomposing certain loops
and look for other bottlenecks such as I/O or other performance
inhibitors.
3–4