Technical data

Optimizing with MACRO-32

3.1.3 Algorithms

At times it is necessary to consider the algorithm that is represented

by the code to be optimized. Some algorithms are not as well suited to

vectorization as others. It may be more effective to change the algorithm

used or the way it is implemented rather than trying to optimize the

existing code. Increasing the work performed in any single loop iteration

and increasing the ratio of arithmetic to load/store instructions are

two effective methods to consider when optimizing an algorithm for

vectorization. Using unity stride rather than nonunity stride and longer

vector lengths are other approaches to consider.

3.2 CROSSOVER POINT

For any given instruction or sequence of instructions, there is a particular

vector length where the scalar and vector processing of equivalent

operations yield the same performance. This vector length is referred to

as the crossover point between scalar and vector processing for the given

instruction or instruction sequence and varies depending on the particular

instruction or sequence. For vector lengths below the crossover point,

scalar operations are faster; above the crossover point vector operations

are more efﬁcient. A low crossover point is considered a beneﬁt, since it

indicates that it is relatively easy to take advantage of the power of the

vector processor.

For any single, isolated vector instruction, the crossover point on the VAX

6000 is quite low, generally about 3 elements. But an instruction is not

performed in isolation. Taken in the context of a routine or application,

other factors affect the performance of the operations on short vectors,

in particular whether the data of the short vector is used in other vector

operations as well. In general, on the VAX 6000 vectorizing as much

code as possible, including short vector length sections, leads to higher

performance through more optimal use of cache. Speciﬁcally, once a set

of data has been operated on by vector instructions, that data will be in

the vector cache. A subsequent scalar operation on any of that same data

will require that the data be moved out of the vector cache into the scalar

cache. A vector operation would not require this data movement and thus

is usually more efﬁcient. Overall, the crossover point on the VAX 6000

is low enough that only for isolated operations on short vectors is scalar

processing the faster alternative.

3–5