Technical data
Optimizing with MACRO-32
startup latency for the second arithmetic instruction (deferred arithemetic
instruction) is a benefit in algorithms that require less than eight Bytes
/FLOP of load/store bandwidth.
Typical algorithms benefit greatly from the ability to chain an arithmetic
operation into a store operation. The vector control unit, along with the
ALU unit, implements this capability. The following sections describe by
instruction type the flow of instructions in the machine.
3.4.1 Load Instruction
When a load instruction is received by the vector control unit, the
destination vector register is checked against outstanding arithmetic
instructions. A load instruction cannot begin execution until the
register to which it will write is free. A register conflict may occur
if the destination register of a load instruction is the same as one of
the registers used by a preceding arithmetic instruction. If instruction
execution overlap could occur if the load instruction were using a different
register, then the register conflict can be eliminated by simply changing
the register used.
If there are no register usage conflicts, the instruction is dispatched to the
load/store unit. An example of a memory access instruction in assembler
notation is as follows:
VLDL base, stride, Vc
where:
VLD = vector load (load memory data into vector register)
L = longword (Q would equal quadword)
base = beginning of first element
stride = number of memory locations (bytes) between the
starting address of the first element and the
next element
Vc = vector register destination result
This instruction means:
Load the vector register (Vc) from memory, starting at the base address
(base), incrementing consecutive addresses by the stride in bytes. The
load operation writes the data from memory into the destination register.
The store operation writes the data from the vector register back to
memory.
3–12