user manual

Use the 3DNow!™ PREFETCH and PREFETCHW Instructions 49

22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization

The following optimization rules were applied to this example.

■ Loops should be unrolled to make sure that the data stride

per loop iteration is equal to the length of a cache line. This

avoids overlapping PREFETCH instructions and thus

optimal use of the available number of outstanding

PREFETCHes.

■ Since the array "array_a" is written rather than read,

PREFETCHW is used instead of PREFETCH to avoid

overhead for switching cache lines to the correct MESI

state. The PREFETCH lookahead has been optimized such

that each loop iteration is working on three cache lines

while six active PREFETCHes bring in the next six cache

lines.

■ Index arithmetic has been reduced to a minimum by use of

complex addressing modes and biasing of the array base

addresses in order to cut down on loop overhead.

Determining Prefetch

Distance

Given the latency of a typical AMD Athlon processor system

and expected processor speeds, the following formula should be

used to determine the prefetch distance in bytes for a single

array:

Prefetch Distance = 200 (

) bytes

■ Round up to the nearest 64-byte cache line.

■ The number 200 is a constant based upon expected

AMD Athlon processor clock frequencies and typical system

memory latencies.

■ DS is the data stride in bytes per loop iteration.

■ C is the number of cycles for one loop to execute entirely

from the L1 cache.

The prefetch distance for multiple arrays are typically even

longer.

Prefetch at Least 64

Bytes Away from

Surrounding Stores

The PREFETCH and PREFETCHW instructions can be

affected by false dependencies on stores. If there is a store to an

address that matches a request, that request (the PREFETCH

or PREFETCHW instruction) may be blocked until the store is

written to the cache. Therefore, code should prefetch data that

is located at least 64 bytes away from any surrounding store’s

data address.