Technical data

Optimizing with MACRO-32
3.1.3 Algorithms
At times it is necessary to consider the algorithm that is represented
by the code to be optimized. Some algorithms are not as well suited to
vectorization as others. It may be more effective to change the algorithm
used or the way it is implemented rather than trying to optimize the
existing code. Increasing the work performed in any single loop iteration
and increasing the ratio of arithmetic to load/store instructions are
two effective methods to consider when optimizing an algorithm for
vectorization. Using unity stride rather than nonunity stride and longer
vector lengths are other approaches to consider.
3.2 CROSSOVER POINT
For any given instruction or sequence of instructions, there is a particular
vector length where the scalar and vector processing of equivalent
operations yield the same performance. This vector length is referred to
as the crossover point between scalar and vector processing for the given
instruction or instruction sequence and varies depending on the particular
instruction or sequence. For vector lengths below the crossover point,
scalar operations are faster; above the crossover point vector operations
are more efficient. A low crossover point is considered a benefit, since it
indicates that it is relatively easy to take advantage of the power of the
vector processor.
For any single, isolated vector instruction, the crossover point on the VAX
6000 is quite low, generally about 3 elements. But an instruction is not
performed in isolation. Taken in the context of a routine or application,
other factors affect the performance of the operations on short vectors,
in particular whether the data of the short vector is used in other vector
operations as well. In general, on the VAX 6000 vectorizing as much
code as possible, including short vector length sections, leads to higher
performance through more optimal use of cache. Specifically, once a set
of data has been operated on by vector instructions, that data will be in
the vector cache. A subsequent scalar operation on any of that same data
will require that the data be moved out of the vector cache into the scalar
cache. A vector operation would not require this data movement and thus
is usually more efficient. Overall, the crossover point on the VAX 6000
is low enough that only for isolated operations on short vectors is scalar
processing the faster alternative.
3–5