Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture
12-4 Vol. 1
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3
• Thread synchronization instructions
— Two instructions that improve synchronization between multi-threaded
agents
The instructions are discussed in more detail in the following paragraphs.
12.3.1 x87 FPU Instruction for Integer Conversion
The FISTTP instruction (x87 FPU Store Integer and Pop with Truncation) behaves like
FISTP, but uses truncation regardless of what rounding mode is specified in the x87
FPU control word. The instruction converts the top of stack (ST0) to integer with
rounding to and pops the stack.
The FISTTP instruction is available in three precisions: short integer (word or 16-bit),
integer (double word or 32-bit), and long integer (64-bit). With FISTTP, applications
no longer need to change the FCW when truncation is required.
12.3.2 SIMD Integer Instruction for Specialized 128-bit Unaligned
Data Load
The LDDQU instruction is a special 128-bit unaligned load designed to avoid cache
line splits. If the address of a 16-byte load is on a 16-byte boundary, LDQQU loads
the bytes requested. If the address of the load is not aligned on a 16-byte boundary,
LDDQU loads a 32-byte block starting at the 16-byte aligned address immediately
below the load request. It then extracts the requested 16 bytes.
The instruction provides significant performance improvement on 128-bit unaligned
memory accesses at the cost of some usage model restrictions.
12.3.3 SIMD Floating-Point Instructions That Enhance
LOAD/MOVE/DUPLICATE Performance
The MOVSHDUP instruction loads/moves 128-bits, duplicating the second and fourth
32-bit data elements.
• MOVSHDUP OperandA, OperandB
— OperandA (128 bits, four data elements): 3
a
, 2
a
, 1
a
, 0
a
— OperandB (128 bits, four data elements): 3
b
, 2
b
, 1
b
, 0
b
— Result (stored in OperandA): 3
b
, 3
b
, 1
b
, 1
b
The MOVSLDUP instruction loads/moves 128-bits, duplicating the first and third
32-bit data elements.
• MOVSLDUP OperandA, OperandB
— OperandA (128 bits, four data elements): 3
a
, 2
a
, 1
a
, 0
a