Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture

ManualsBrandsIntel ManualsOtherIntel Pentium 4 Processor 2.80 GHz, 512K Cache, 533 MHz FSB

331

332

333

334

335

336

337

338

339

340

11-36 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

• Use of the 64-bit shift by bit instructions (PSRLQ, PSLLQ) can be extended to 128

bits in either of two ways:

— Use of PSRLQ and PSLLQ, along with masking logic operations.

— Rewriting the code sequence to use PSRLDQ and PSLLDQ (shift double

quadword operand by bytes)

• Loop counters need to be updated, since each 128-bit SIMD integer instruction

operates on twice the amount of data as its 64-bit SIMD integer counterpart.

11.6.12 Branching on Arithmetic Operations

There are no condition codes in SSE or SSE2 states. A packed-data comparison

instruction generates a mask which can then be transferred to an integer register.

The following code sequence provides an example of how to perform a conditional

branch, based on the result of an SSE2 arithmetic operation.

cmppd XMM0, XMM1 ; generates a mask in XMM0

movmskpd EAX, XMM0 ; moves a 2 bit mask to eax

test EAX, 0,2 ; compare with desired result

jne BRANCH TARGET

The COMISD and UCOMISD instructions update the EFLAGS as the result of a scalar

comparison. A conditional branch can then be scheduled immediately following

COMISD/UCOMISD.

11.6.13 Cacheability Hint Instructions

SSE and SSE2 cacheability control instructions enable the programmer to control

prefetching, caching, loading and storing of data. When correctly used, these instruc-

tions improve application performance.

To make efficient use of the processor’s super-scalar microarchitecture, a program

needs to provide a steady stream of data to the executing program to avoid stalling

the processor. PREFETCHh instructions minimize the latency of data accesses in

performance-critical sections of application code by allowing data to be fetched into

the processor cache hierarchy in advance of actual usage.

PREFETCHh instructions do not change the user-visible semantics of a program,

although they may affect performance. The operation of these instructions is imple-

mentation-dependent. Programmers may need to tune code for each IA-32

processor implementation. Excessive usage of PREFETCHh instructions may waste

memory bandwidth and reduce performance. For more detailed information on the

use of prefetch hints, refer to Chapter 6, “Optimizing Cache Usage”, in the Intel® 64

and IA-32 Architectures Optimization Reference Manual.

The non-temporal store instructions (MOVNTI, MOVNTPD, MOVNTPS, MOVNTDQ,

MOVNTQ, MASKMOVQ, and MASKMOVDQU) minimize cache pollution when writing

non-temporal data to memory (see Section 10.4.6.2, “Caching of Temporal vs. Non-