Technical data

Optimizing with MACRO-32
most efficient in that it runs sequentially through the data and makes full
use of all PTEs fetched.
An example of how to avoid large vector strides can be seen in a simple
matrix multiplication problem:
DO I = 1, N
DO J = 1, N
DO K = 1, N
C(I,J) = C(I,J) + A(I,K)*B(K,J)
ENDDO
ENDDO
ENDDO
If coded as written, there is a choice of which variable to vectorize on.
If the "K" variable is chosen, array A will access FORTRAN rows that
are nonunity stride. This choice also means that for every K, a reduction
operation is required to sum the product of A and B into the C array.
Although reduction functions vectorize, they are less efficient than other
methods.
A better choice is to vectorize on either I or J. J is not the best candidate
because it involves nonunity stride for both the B and the C arrays.
For large values of N, this is an inefficient use of the bus bandwidth,
the translation buffer, and the cache. Clearly the optimal solution is to
vectorize on the I variable.
Example 3–8 shows a first attempt to code the matrix multiplication in
MACRO pseudocode for vectors. Although this example uses unity stride,
it is far from optimal. Notice that it is not necessary to load and store
C for different values of K because C is dependent only on the I and J
variables. By removing the load and store of C from the inner loop, the
bytes/FLOP ratio (load and stores: arithmetics) drops from 12 to 2 down
to 4 to 2. Example 3–9 shows an improved version.
3–23