Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture
Vol. 1 11-15
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)
operands in XMM registers or memory (the latter for at most one source operand).
When the conversion is inexact, the rounded value according to the rounding mode
selected in the MXCSR register is returned.
11.4.2 SSE2 64-Bit and 128-Bit SIMD Integer Instructions
SSE2 extensions add several 128-bit packed integer instructions to the IA-32 archi-
tecture. Where appropriate, a 64-bit version of each of these instructions is also
provided. The 128-bit versions of instructions operate on data in XMM registers;
64-bit versions operate on data in MMX registers. The instructions follow.
The MOVDQA (move aligned double quadword) instruction transfers a double quad-
word operand from memory to an XMM register or vice versa; or between XMM regis-
ters. The memory address must be aligned to a 16-byte boundary; otherwise, a
general-protection exception (#GP) is generated.
The MOVDQU (move unaligned double quadword) instruction performs the same
operations as the MOVDQA instruction, except that 16-byte alignment of a memory
address is not required.
The PADDQ (packed quadword add) instruction adds two packed quadword integer
operands or two single quadword integer operands, and stores the results in an XMM
or MMX register, respectively. This instruction can operate on either unsigned or
signed (two’s complement notation) integer operands.
The PSUBQ (packed quadword subtract) instruction subtracts two packed quadword
integer operands or two single quadword integer operands, and stores the results in
an XMM or MMX register, respectively. Like the PADDQ instruction, PSUBQ can
operate on either unsigned or signed (two’s complement notation) integer operands.
The PMULUDQ (multiply packed unsigned doubleword integers) instruction performs
an unsigned multiply of unsigned doubleword integers and returns a quadword
result. Both 64-bit and 128-bit versions of this instruction are available. The 64-bit
version operates on two doubleword integers stored in the low doubleword of each
source operand, and the quadword result is returned to an MMX register. The 128-bit
version performs a packed multiply of two pairs of doubleword integers. Here, the
doublewords are packed in the first and third doublewords of the source operands,
and the quadword results are stored in the low and high quadwords of an XMM
register.
The PSHUFLW (shuffle packed low words) instruction shuffles the word integers
packed into the low quadword of the source operand and stores the shuffled result in
the low quadword of the destination operand. An 8-bit immediate operand specifies
the shuffle order.
The PSHUFHW (shuffle packed high words) instruction shuffles the word integers
packed into the high quadword of the source operand and stores the shuffled result
in the high quadword of the destination operand. An 8-bit immediate operand speci-
fies the shuffle order.