Technical data

Optimizing with MACRO-32
3.10 REGISTER REUSE
The concept used in Example 3–9 to reuse the data when it has already
been loaded into a vector register is known as register reuse. Register
reuse can be extended further by using all available vector registers to
decrease the bytes/FLOP ratio and improve performance. With maximum
register reuse, programs on the VAX 6000 Model 400 vector processor can
approach a peak single-precision performance of 90 MFLOPs and a peak
double-precision performance of 45 MFLOPs.
To implement register reuse for matrix multiply, the J loop must be
unrolled. By precomputing 14 partial results, using only the first column
of A with 14 different columns of B, it is possible to use 14 vector registers
(instead of 14 memory locations) to hold the partial results. Thus, all N
rows of B can be accessed in groups of 14 columns to compute the first
14 columns of C. When the final row of B is reached, the results are
chained into a store into array C. Then the next set of 14 columns of C
will be calculated. The unrolling depth of 14 is chosen because of the
number of vector registers. Example 3–10 shows the MACRO pseudocode
to accomplish this for values of N <= 64. Although the code length is
longer, the performance is greatly improved by the segments of code that
are purely vector arithmetics. The bytes/FLOP ratio has dropped to better
than 4 to 14, allowing the algorithm to approach peak vector speeds.
When implemented in matrix solvers, speedups greater than 25 have been
realized in a VAX 6000 Model 410 vector processor computer system.
3–25