user manual

Use MMX PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel 111
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Newton-Raphson Reciprocal Square Root
The general Newton-Raphson reciprocal square root recurrence
is:
Z
i+1
= 1/2 Z
i
(3 – b Z
i
2
)
To reduce the number of iterations, the initial approximation
read from a table. The 3DNow! reciprocal square root
approximation is accurate to at least 15 bits. Accordingly, to
obtain a single-precision 24-bit reciprocal square root of an
input operand b, one Newton-Raphson iteration is required,
using the following sequence of 3DNow! instructions:
X
0
= PFRSQRT(b)
X
1
= PFMUL(X
0
,X
0
)
X
2
= PFRSQIT1(b,X
1
)
X
3
= PFRCPIT2(X
2
,X
0
)
X
4
= PFMUL(b,X
3
)
The 24-bit final reciprocal square root value is X
3
. In the
AMD Athlon processor 3DNow! implementation, the estimate
contains the correct round-to-nearest value for approximately
87% of all arguments. The remaining arguments differ from the
correct round-to-nearest value by one unit-in-the-last-place. The
square root (X
4
) is formed in the last step by multiplying by the
input operand b.
Use MMX PMADDWD Instruction to Perform Two 32-Bit
Multiplies in Parallel
The MMX PMADDWD instruction can be used to perform two
signed 16x1632 bit multiplies in parallel, with much higher
performance than can be achieved using the IMUL instruction.
The PMADDWD instruction is designed to perform four
16x1632 bit signed multiplies and accumulate the results
pairwise. By making one of the results in a pair a zero, there are
now just two multiplies. The following example shows how to
multiply 16-bit signed numbers a,b,c,d into signed 32-bit
products a×c and b×d: