Technical data

Optimizing with MACRO-32

A cache miss is serviced by a hexword ﬁll. On the XMI, a hexword

transfer is 80 percent efﬁcient since one address is sent to receive four

quadwords of data. An octaword transfer is 67 percent efﬁcient since one

address is sent to receive two quadwords of data. A quadword transfer is

only 50 percent efﬁcient since one address is sent to receive one quadword

of data. For this reason, stores are more efﬁcient with unity stride than

with nonunity stride. A larger piece of memory can be referenced by a

single address so that fewer memory references are required.

In the case of load instructions, the comparison of unity and nonunity

stride is less straightforward. A nonunity stride cache miss load causes a

full hexword to be read from memory even though the load requires only

a longword or quadword of data. If the additional data is not referenced

by subsequent load instructions, then the nonunity stride load is much

less efﬁcient than a unity stride load. If subsequent loads do reference

the extra data, then nonunity stride load performance improves due

to high cache hit rates for the subsequent loads. For double-precision

data there is little degradation due to nonunity stride in this case. For

single-precision data, unity stride loads will show signiﬁcantly higher

performance because of the load/store pipeline optimization for single-

precision unity stride loads.

3.9 STRIDE/TRANSLATION BUFFER MISS

A vector’s stride is the number of memory locations (bytes) between the

starting address of consecutive vector elements. A vector with a stride of

1 is contiguous; it has no gaps in memory between vector elements.

Consider the vector arrays A and B in the following DO loop. Vector A

has a stride of 1; vector B has a stride of 2.

DO 100 I=1,5

A(I) = B(I*2)

100 CONTINUE

When a translation buffer (TB) miss occurs, two PTEs (1 quadword) are

fetched from cache. If this fetch results in a cache miss, then a hexword

(eight PTEs) is loaded into cache from memory but only two PTEs are

installed in the TB.

This handling of TB misses has a large effect on the performance of

nonunity stride vectors. A stride of two pages (256 longwords or 128

quadwords) or more can result in a TB miss for each data item. A stride

of eight pages (1024 longwords or 512 quadwords) or more can result in a

TB miss that can cause a cache miss for each data item. Unity stride is

3–22