user manual

Accelerating Floating-Point Divides and Square Roots 29
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
quadword alignment), so that quadword operands might be
misaligned, even if this technique is used and the compiler does
allocate variables in the order they are declared.
The following example demonstrates the reordering of local
variable declarations:
Original ordering (Avoid):
short ga, gu, gi;
long foo, bar;
double x, y, z[3];
char a, b;
float baz;
Improved ordering (Preferred):
double z[3];
double x, y;
long foo, bar;
float baz;
short ga, gu, gi;
See Sort Variables According to Base Type Size on page 56 for
more information from a different perspective.
Accelerating Floating-Point Divides and Square Roots
Divides and square roots have a much longer latency than other
floating-point operations, even though the AMD Athlon
processor provides significant acceleration of these two
operations. In some codes, these operations occur so often as to
seriously impact performance. In these cases, it is
recommended to port the code to 3DNow! inline assembly or to
use a compiler that can generate 3DNow! code. If code has hot
spots that use single-precision arithmetic only (i.e., all
computation involves data of type float) and for some reason
cannot be ported to 3DNow!, the following technique may be
used to improve performance.
The x87 FPU has a precision-control field as part of the FPU
control word. The precision-control setting determines what
precision results get rounded to. It affects the basic arithmetic
operations, including divides and square roots. AMD Athlon
and AMD-K6
®
family processors implement divide and square
root in such fashion as to only compute the number of bits