Technical data
Algorithm Optimization Examples
by 1000 obtained a vector speedup of approximately 25, as shown in
Figure A–1.
A.2 SIGNAL PROCESSING—FAST FOURIER TRANSFORMS
The Fourier transform decomposes a waveform, or more generally, a
collection of data, into component sine and cosine representation. The
discrete Fourier transform (DFT) of a data set of length N performs the
transformation following the strict mathematical definition which requires
O(N**2) floating-point operations. The fast Fourier transform (FFT),
developed by Cooley and Tukey in 1965, reduced the number of operations
to O(N x LOG[N]), improving computational speed significantly.
Figure A–2 shows that the complex data in the bottom butterfly is
multiplied in each stage by the appropriate weight. The result is then
added to the top butterfly and subtracted from the bottom butterfly. If the
algorithm is left in this configuration, it must use nonunity stride vectors,
very short vectors, or masked arithmetic operations to perform the very
small butterflies.
A.2.1 Optimized One-Dimensional Fast Fourier Transforms
The bit-reversal process that permutes the data to a form that enables the
Cooley-Tukey algorithm to work is also shown in Figure A–2. When using
vectors, a common approach to performing the bit-reversal reordering
is to use vector gather or scatter instructions. These instructions allow
vector loads and stores to be performed using an index register. Vector
loads and stores require a constant stride. However, vector gather and
scatter operations allow the user to build a vector of offsets to support
indirect addressing in vector mode. Both gather and scatter instructions
are available with VAX vectors.
A vector implementation of the FFT algorithm has been developed that is
well suited for the VAX vector architecture. One optimization made to the
algorithm involves moving the bit-reversal section of the code to a place
where the data permutation will benefit vector processing. By doing so,
two goals are accomplished. First, the slower vector gather operations are
moved to the center of the algorithm such that the data will already be in
the vector cache. In Figure A–2, the first FFT stage starts out with large
butterfly distances. After each stage the butterfly distance is halved. For
the optimized version shown in Figure A–3, the bit-reversal permutation
A–7