Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture
12-12 Vol. 1
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3
12.6.3 Multiply and Add Packed Signed and Unsigned Bytes
There are two multiply-and-add-packed-signed-unsigned-byte instructions (repre-
sented by one mnemonic). One operates on 128-bit operands and the other operates
on 64-bit operands. Multiplications are performed on each vertical pair of data
elements. The data elements in the source operand are signed byte values, the input
data elements of the destination operand are unsigned byte values.
• PMADDUBSW multiplies each unsigned byte value with the corresponding signed
byte value to produce an intermediate, 16-bit signed integer. Each adjacent pair
of 16-bit signed values are added horizontally. The signed, saturated 16-bit
results are packed to the destination operand.
12.6.4 Packed Multiply High with Round and Scale
There are two packed-multiply-high-with-round-and-scale instructions (represented
by one mnemonic). One operates on 128-bit operands and the other operates on
64-bit operands. Multiplications are performed on each vertical pair of 16-bit data
elements. The data elements in the source operand are signed integers, the data
elements of the destination operand are unsigned integers.
• PMULHRSW multiplies vertically each signed 16-bit integer from the destination
operand with the corresponding signed 16-bit integer of the source operand,
producing intermediate, signed 32-bit integers. Each intermediate 32-bit integer
is truncated to the 18 most significant bits. Rounding is always performed by
adding 1 to the least significant bit of the 18-bit intermediate result. The final
result is obtained by selecting the 16 bits immediately to the right of the most
significant bit of each 18-bit intermediate result and packed to the destination
operand.
12.6.5 Packed Shuffle Bytes
There are two packed-shuffle-bytes instructions (represented by one mnemonic).
One operates on 128-bit operands and the other operates on 64-bit operands. The
shuffle operations are performed bytewise on the destination operand using the
source operand as a control mask.
• PSHUFB permutes each byte in place, according to a shuffle control mask. The
least significant three or four bits of each shuffle control byte of the control mask
form the shuffle index. The shuffle mask is unaffected. If the most significant bit
(bit 7) of a shuffle control byte is set, the constant zero is written in the result
byte.