Technical data

Optimizing with MACRO-32
A cache miss is serviced by a hexword fill. On the XMI, a hexword
transfer is 80 percent efficient since one address is sent to receive four
quadwords of data. An octaword transfer is 67 percent efficient since one
address is sent to receive two quadwords of data. A quadword transfer is
only 50 percent efficient since one address is sent to receive one quadword
of data. For this reason, stores are more efficient with unity stride than
with nonunity stride. A larger piece of memory can be referenced by a
single address so that fewer memory references are required.
In the case of load instructions, the comparison of unity and nonunity
stride is less straightforward. A nonunity stride cache miss load causes a
full hexword to be read from memory even though the load requires only
a longword or quadword of data. If the additional data is not referenced
by subsequent load instructions, then the nonunity stride load is much
less efficient than a unity stride load. If subsequent loads do reference
the extra data, then nonunity stride load performance improves due
to high cache hit rates for the subsequent loads. For double-precision
data there is little degradation due to nonunity stride in this case. For
single-precision data, unity stride loads will show significantly higher
performance because of the load/store pipeline optimization for single-
precision unity stride loads.
3.9 STRIDE/TRANSLATION BUFFER MISS
A vector’s stride is the number of memory locations (bytes) between the
starting address of consecutive vector elements. A vector with a stride of
1 is contiguous; it has no gaps in memory between vector elements.
Consider the vector arrays A and B in the following DO loop. Vector A
has a stride of 1; vector B has a stride of 2.
DO 100 I=1,5
A(I) = B(I*2)
100 CONTINUE
When a translation buffer (TB) miss occurs, two PTEs (1 quadword) are
fetched from cache. If this fetch results in a cache miss, then a hexword
(eight PTEs) is loaded into cache from memory but only two PTEs are
installed in the TB.
This handling of TB misses has a large effect on the performance of
nonunity stride vectors. A stride of two pages (256 longwords or 128
quadwords) or more can result in a TB miss for each data item. A stride
of eight pages (1024 longwords or 512 quadwords) or more can result in a
TB miss that can cause a cache miss for each data item. Unity stride is
3–22