user manual

Use the 3DNow! PREFETCH and PREFETCHW Instructions 49
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
The following optimization rules were applied to this example.
Loops should be unrolled to make sure that the data stride
per loop iteration is equal to the length of a cache line. This
avoids overlapping PREFETCH instructions and thus
optimal use of the available number of outstanding
PREFETCHes.
Since the array "array_a" is written rather than read,
PREFETCHW is used instead of PREFETCH to avoid
overhead for switching cache lines to the correct MESI
state. The PREFETCH lookahead has been optimized such
that each loop iteration is working on three cache lines
while six active PREFETCHes bring in the next six cache
lines.
Index arithmetic has been reduced to a minimum by use of
complex addressing modes and biasing of the array base
addresses in order to cut down on loop overhead.
Determining Prefetch
Distance
Given the latency of a typical AMD Athlon processor system
and expected processor speeds, the following formula should be
used to determine the prefetch distance in bytes for a single
array:
Prefetch Distance = 200 (
DS
/
C
) bytes
Round up to the nearest 64-byte cache line.
The number 200 is a constant based upon expected
AMD Athlon processor clock frequencies and typical system
memory latencies.
DS is the data stride in bytes per loop iteration.
C is the number of cycles for one loop to execute entirely
from the L1 cache.
The prefetch distance for multiple arrays are typically even
longer.
Prefetch at Least 64
Bytes Away from
Surrounding Stores
The PREFETCH and PREFETCHW instructions can be
affected by false dependencies on stores. If there is a store to an
address that matches a request, that request (the PREFETCH
or PREFETCHW instruction) may be blocked until the store is
written to the cache. Therefore, code should prefetch data that
is located at least 64 bytes away from any surrounding stores
data address.