Technical data

Optimizing with MACRO-32

most efﬁcient in that it runs sequentially through the data and makes full

use of all PTEs fetched.

An example of how to avoid large vector strides can be seen in a simple

matrix multiplication problem:

DO I = 1, N

DO J = 1, N

DO K = 1, N

C(I,J) = C(I,J) + A(I,K)*B(K,J)

ENDDO

If coded as written, there is a choice of which variable to vectorize on.

If the "K" variable is chosen, array A will access FORTRAN rows that

are nonunity stride. This choice also means that for every K, a reduction

operation is required to sum the product of A and B into the C array.

Although reduction functions vectorize, they are less efﬁcient than other

methods.

A better choice is to vectorize on either I or J. J is not the best candidate

because it involves nonunity stride for both the B and the C arrays.

For large values of N, this is an inefﬁcient use of the bus bandwidth,

the translation buffer, and the cache. Clearly the optimal solution is to

vectorize on the I variable.

Example 3–8 shows a ﬁrst attempt to code the matrix multiplication in

MACRO pseudocode for vectors. Although this example uses unity stride,

it is far from optimal. Notice that it is not necessary to load and store

C for different values of K because C is dependent only on the I and J

variables. By removing the load and store of C from the inner loop, the

bytes/FLOP ratio (load and stores: arithmetics) drops from 12 to 2 down

to 4 to 2. Example 3–9 shows an improved version.

3–23