user manual

Use the FXCH Instruction Rather than FST/FLD Pairs 99
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
These instructions are much faster than the classical approach
using FSTSW, because FSTSW is essentially a serializing
instruction on the AMD Athlon processor. When FSTSW cannot
be avoided (for example, backward compatibility of code with
older processors), no FPU instruction should occur between an
FCOM[P], FICOM[P], FUCOM[P], or FTST and a dependent
FSTSW. This optimization allows the use of a fast forwarding
mechanism for the FPU condition codes internal to the
AMD Athlon processor FPU and increases performance.
Use the FXCH Instruction Rather than FST/FLD Pairs
Increase parallelism by breaking up dependency chains or by
evaluating multiple dependency chains simultaneously by
explicitly switching execution between them. Although the
AMD Athlon processor FPU has a deep scheduler, which in
most cases can extract sufficient parallelism from existing code,
long dependency chains can stall the scheduler while issue slots
are still available. The maximum dependency chain length that
the scheduler can absorb is about six 4-cycle instructions.
To switch execution between dependency chains, use of the
FXCH instruction is recommended because it has an apparent
latency of zero cycles and generates only one OP. The
AMD Athlon processor FPU contains special hardware to
handle up to three FXCH instructions per cycle. Using FXCH is
preferred over the use of FST/FLD pairs, even if the FST/FLD
pair works on a register. An FST/FLD pair adds two cycles of
latency and consists of two OPs.
Avoid Using Extended-Precision Data
Store data as either single-precision or double-precision
quantities. Loading and storing extended-precision data is
comparatively slower.