user manual

110 Use 3DNow! Instructions for Fast Square Root and
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
Use 3DNow! Instructions for Fast Square Root and
Reciprocal Square Root
3DNow! instructions can be used to compute a very fast, highly
accurate square root and reciprocal square root.
Optimized 15-Bit Precision Square Root
This square root operation can be executed in only 7 cycles,
assuming a program hides the latency of the first MOVD
instruction within previous code. The reciprocal square root
operation requires four less cycles than the square root
operation.
Example:
MOVD MM0, [MEM] ; 0 | a
PFRSQRT MM1, MM0 ;1/sqrt(a) | 1/sqrt(a) (approximate)
PUNPCKLDQ MM0, MM0 ; a | a (MMX instr.)
PFMUL MM0, MM1 ; sqrt(a) | sqrt(a)
Optimized 24-Bit Precision Square Root
This square root operation can be executed in only 19 cycles,
assuming a program hides the latency of the first MOVD
instruction within previous code. The reciprocal square root
operation requires four less cycles than the square root
operation.
Example:
MOVD MM0, [MEM] ; 0 | a
PFRSQRT MM1, MM0 ; 1/sqrt(a) | 1/sqrt(a) (approx.)
MOVQ MM2, MM1 ; X_0 = 1/(sqrt a) (approx.)
PFMUL MM1, MM1 ;
X_0 * X_0 | X_0 * X_0 (step 1)
PUNPCKLDQ MM0, MM0 ; a | a (MMX instr)
PFRSQIT1 MM1, MM0 ; (intermediate) (step 2)
PFRCPIT2 MM1, MM2 ; 1/sqrt(a) | 1/sqrt(a) (step 3)
PFMUL MM0, MM1 ; sqrt(a) | sqrt(a)