Technical data

Optimizing with MACRO-32

3.10 REGISTER REUSE

The concept used in Example 3–9 to reuse the data when it has already

been loaded into a vector register is known as register reuse. Register

reuse can be extended further by using all available vector registers to

decrease the bytes/FLOP ratio and improve performance. With maximum

approach a peak single-precision performance of 90 MFLOPs and a peak

double-precision performance of 45 MFLOPs.

To implement register reuse for matrix multiply, the J loop must be

unrolled. By precomputing 14 partial results, using only the ﬁrst column

of A with 14 different columns of B, it is possible to use 14 vector registers

(instead of 14 memory locations) to hold the partial results. Thus, all N

rows of B can be accessed in groups of 14 columns to compute the ﬁrst

14 columns of C. When the ﬁnal row of B is reached, the results are

chained into a store into array C. Then the next set of 14 columns of C

will be calculated. The unrolling depth of 14 is chosen because of the

number of vector registers. Example 3–10 shows the MACRO pseudocode

to accomplish this for values of N <= 64. Although the code length is

longer, the performance is greatly improved by the segments of code that

are purely vector arithmetics. The bytes/FLOP ratio has dropped to better

than 4 to 14, allowing the algorithm to approach peak vector speeds.

When implemented in matrix solvers, speedups greater than 25 have been

realized in a VAX 6000 Model 410 vector processor computer system.

3–25