user manual

Use 3DNow! Instructions for Fast Division 109
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Pipelined Pair of 24-Bit Precision Divides
This divide operation executes with a total latency of 21 cycles,
assuming that the program hides the latency of the first
MOVD/MOVQ instructions within preceding code.
Example:
MOVQ MM0, [DIVISORS] ; y | x
PFRCP MM1, MM0 ; 1/x | 1/x (approximate)
MOVQ MM2, MM0 ; y | x
PUNPCKHDQ MM0, MM0 ; y | y
PFRCP MM0, MM0 ; 1/y | 1/y (approximate)
PUNPCKLDQ MM1, MM0 ; 1/y | 1/x (approximate)
MOVQ MM0, [DIVIDENDS] ; z | w
PFRCPIT1 MM2, MM1 ; 1/y | 1/x (intermediate)
PFRCPIT2 MM2, MM1 ; 1/y | 1/x (final)
PFMUL MM0, MM2 ; z/y | w/x
Newton-Raphson Reciprocal
Consider the quotient q =
a
/
b
. An (on-chip) ROM-based table
lookup can be used to quickly produce a 14-to-15-bit precision
approximation of
1
/
b
using just one PFRCP instruction. A full
24-bit precision reciprocal can then be quickly computed from
this approximation using a Newton Raphson algorithm.
The general Newton-Raphson recurrence for the reciprocal is as
follows:
Z
i+1
= Z
i
(2 – b Z
i
)
Given that the initial approximation is accurate to at least 14
bits, and that a full IEEE single-precision mantissa contains 24
bits, just one Newton-Raphson iteration is required. The
following sequence shows the 3DNow! instructions that produce
the initial reciprocal approximation, compute the full precision
reciprocal from the approximation, and finally, complete the
desired divide of
a
/
b
.
X
0
= PFRCP(b)
X
1
= PFRCPIT1(b,X
0
)
X
2
= PFRCPIT2(X
1
,X
0
)
q = PFMUL(a,X
2
)
The 24-bit final reciprocal value is X
2
. In the AMD Athlon
processor 3DNow! technology implementation the operand X
2
contains the correct round-to-nearest single precision
reciprocal for approximately 99% of all arguments.