Technical data

Vector Processing Concepts

Because most arithmetic and memory operations can be broken down into

a series of one-cycle steps, the function units of a vector processor are

generally pipelined. Thus, after initial pipeline latency, the function units

can process an entire vector in the number of cycles equal to the length of

the input vector—one vector element result per cycle. This time interval

(known as a chime) is approximately equal (in cycles) to the length of the

vector plus the pipeline latency.

A vector instruction operates on an array of data, so the pipelined

execution of vector instructions allows the overlap of multiple iterations

of the same vector instruction operating on different data items. The

pipeline length equals its number of segments. The maximum number

of data elements operated on at any one time equals the pipeline length.

Pipelining accommodates the variable array lengths found in vector

instructions.

Instruction pipelining can be enhanced by providing multiple parallel

pipelines, which operate on different vector elements, within a function

unit. As an example, assume a vector has 64 elements. If the vector

processor has a function unit with four pipelines, the following processing

can be executed in parallel:

Pipe 0 operates on elements 0, 4, 8, ... , 60

Pipe 1 operates on elements 1, 5, 9, ... , 61

Pipe 2 operates on elements 2, 6, 10, ... , 62

Pipe 3 operates on elements 3, 7, 11, ... , 63

This obviously results in much faster execution than a single pipeline,

giving four results per cycle instead of only one. After the pipeline

latency, the 64 elements can be processed in 16 cycles rather than in 64.

1–13