user manual

ManualsBrandsAMD ManualsTypewriterTypewriter x86

121

122

123

124

125

126

127

128

129

130

Use 3DNow!™ Instructions for Fast Division 109

22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization

Pipelined Pair of 24-Bit Precision Divides

This divide operation executes with a total latency of 21 cycles,

assuming that the program hides the latency of the first

MOVD/MOVQ instructions within preceding code.

Example:

MOVQ MM0, [DIVISORS] ; y | x

PFRCP MM1, MM0 ; 1/x | 1/x (approximate)

MOVQ MM2, MM0 ; y | x

PUNPCKHDQ MM0, MM0 ; y | y

PFRCP MM0, MM0 ; 1/y | 1/y (approximate)

PUNPCKLDQ MM1, MM0 ; 1/y | 1/x (approximate)

MOVQ MM0, [DIVIDENDS] ; z | w

PFRCPIT1 MM2, MM1 ; 1/y | 1/x (intermediate)

PFRCPIT2 MM2, MM1 ; 1/y | 1/x (final)

PFMUL MM0, MM2 ; z/y | w/x

Newton-Raphson Reciprocal

Consider the quotient q =

. An (on-chip) ROM-based table

lookup can be used to quickly produce a 14-to-15-bit precision

approximation of

using just one PFRCP instruction. A full

24-bit precision reciprocal can then be quickly computed from

this approximation using a Newton Raphson algorithm.

The general Newton-Raphson recurrence for the reciprocal is as

follows:

i+1

= Z

• (2 – b • Z

)

Given that the initial approximation is accurate to at least 14

bits, and that a full IEEE single-precision mantissa contains 24

bits, just one Newton-Raphson iteration is required. The

following sequence shows the 3DNow! instructions that produce

the initial reciprocal approximation, compute the full precision

reciprocal from the approximation, and finally, complete the

desired divide of

= PFRCP(b)

= PFRCPIT1(b,X

)

= PFRCPIT2(X

)

q = PFMUL(a,X

)

The 24-bit final reciprocal value is X

. In the AMD Athlon

processor 3DNow! technology implementation the operand X

contains the correct round-to-nearest single precision

reciprocal for approximately 99% of all arguments.