user manual

Use the FXCH Instruction Rather than FST/FLD Pairs 99

22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization

These instructions are much faster than the classical approach

using FSTSW, because FSTSW is essentially a serializing

instruction on the AMD Athlon processor. When FSTSW cannot

be avoided (for example, backward compatibility of code with

older processors), no FPU instruction should occur between an

FCOM[P], FICOM[P], FUCOM[P], or FTST and a dependent

FSTSW. This optimization allows the use of a fast forwarding

mechanism for the FPU condition codes internal to the

AMD Athlon processor FPU and increases performance.

Use the FXCH Instruction Rather than FST/FLD Pairs

Increase parallelism by breaking up dependency chains or by

evaluating multiple dependency chains simultaneously by

explicitly switching execution between them. Although the

AMD Athlon processor FPU has a deep scheduler, which in

most cases can extract sufficient parallelism from existing code,

long dependency chains can stall the scheduler while issue slots

are still available. The maximum dependency chain length that

the scheduler can absorb is about six 4-cycle instructions.

To switch execution between dependency chains, use of the

FXCH instruction is recommended because it has an apparent

latency of zero cycles and generates only one OP. The

AMD Athlon processor FPU contains special hardware to

handle up to three FXCH instructions per cycle. Using FXCH is

preferred over the use of FST/FLD pairs, even if the FST/FLD

pair works on a register. An FST/FLD pair adds two cycles of

latency and consists of two OPs.

Avoid Using Extended-Precision Data

Store data as either single-precision or double-precision

quantities. Loading and storing extended-precision data is

comparatively slower.