Technical data
Vector Processing Concepts
Because most arithmetic and memory operations can be broken down into
a series of one-cycle steps, the function units of a vector processor are
generally pipelined. Thus, after initial pipeline latency, the function units
can process an entire vector in the number of cycles equal to the length of
the input vector—one vector element result per cycle. This time interval
(known as a chime) is approximately equal (in cycles) to the length of the
vector plus the pipeline latency.
A vector instruction operates on an array of data, so the pipelined
execution of vector instructions allows the overlap of multiple iterations
of the same vector instruction operating on different data items. The
pipeline length equals its number of segments. The maximum number
of data elements operated on at any one time equals the pipeline length.
Pipelining accommodates the variable array lengths found in vector
instructions.
Instruction pipelining can be enhanced by providing multiple parallel
pipelines, which operate on different vector elements, within a function
unit. As an example, assume a vector has 64 elements. If the vector
processor has a function unit with four pipelines, the following processing
can be executed in parallel:
Pipe 0 operates on elements 0, 4, 8, ... , 60
Pipe 1 operates on elements 1, 5, 9, ... , 61
Pipe 2 operates on elements 2, 6, 10, ... , 62
Pipe 3 operates on elements 3, 7, 11, ... , 63
This obviously results in much faster execution than a single pipeline,
giving four results per cycle instead of only one. After the pipeline
latency, the 64 elements can be processed in 16 cycles rather than in 64.
1–13