Technical data

Optimizing with MACRO-32

startup latency for the second arithmetic instruction (deferred arithemetic

instruction) is a beneﬁt in algorithms that require less than eight Bytes

/FLOP of load/store bandwidth.

Typical algorithms beneﬁt greatly from the ability to chain an arithmetic

operation into a store operation. The vector control unit, along with the

ALU unit, implements this capability. The following sections describe by

instruction type the ﬂow of instructions in the machine.

3.4.1 Load Instruction

When a load instruction is received by the vector control unit, the

destination vector register is checked against outstanding arithmetic

instructions. A load instruction cannot begin execution until the

if the destination register of a load instruction is the same as one of

the registers used by a preceding arithmetic instruction. If instruction

execution overlap could occur if the load instruction were using a different

the register used.

If there are no register usage conﬂicts, the instruction is dispatched to the

load/store unit. An example of a memory access instruction in assembler

notation is as follows:

VLDL base, stride, Vc

where:

VLD = vector load (load memory data into vector register)

L = longword (Q would equal quadword)

base = beginning of first element

stride = number of memory locations (bytes) between the

starting address of the first element and the

next element

Vc = vector register destination result

This instruction means:

Load the vector register (Vc) from memory, starting at the base address

(base), incrementing consecutive addresses by the stride in bytes. The

load operation writes the data from memory into the destination register.

The store operation writes the data from the vector register back to

memory.

3–12