Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture

ManualsBrandsIntel ManualsOtherIntel Pentium 4 Processor 2.80 GHz, 512K Cache, 533 MHz FSB

331

332

333

334

335

336

337

338

339

340

12-4 Vol. 1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

• Thread synchronization instructions

— Two instructions that improve synchronization between multi-threaded

agents

The instructions are discussed in more detail in the following paragraphs.

12.3.1 x87 FPU Instruction for Integer Conversion

The FISTTP instruction (x87 FPU Store Integer and Pop with Truncation) behaves like

FISTP, but uses truncation regardless of what rounding mode is specified in the x87

FPU control word. The instruction converts the top of stack (ST0) to integer with

rounding to and pops the stack.

The FISTTP instruction is available in three precisions: short integer (word or 16-bit),

integer (double word or 32-bit), and long integer (64-bit). With FISTTP, applications

no longer need to change the FCW when truncation is required.

12.3.2 SIMD Integer Instruction for Specialized 128-bit Unaligned

Data Load

The LDDQU instruction is a special 128-bit unaligned load designed to avoid cache

line splits. If the address of a 16-byte load is on a 16-byte boundary, LDQQU loads

the bytes requested. If the address of the load is not aligned on a 16-byte boundary,

LDDQU loads a 32-byte block starting at the 16-byte aligned address immediately

below the load request. It then extracts the requested 16 bytes.

The instruction provides significant performance improvement on 128-bit unaligned

memory accesses at the cost of some usage model restrictions.

12.3.3 SIMD Floating-Point Instructions That Enhance

LOAD/MOVE/DUPLICATE Performance

The MOVSHDUP instruction loads/moves 128-bits, duplicating the second and fourth

32-bit data elements.

• MOVSHDUP OperandA, OperandB

— OperandA (128 bits, four data elements): 3

, 2

, 1

, 0

— OperandB (128 bits, four data elements): 3

, 2

, 1

, 0

— Result (stored in OperandA): 3

, 3

, 1

The MOVSLDUP instruction loads/moves 128-bits, duplicating the first and third

32-bit data elements.

• MOVSLDUP OperandA, OperandB

— OperandA (128 bits, four data elements): 3

, 2

, 1

, 0