Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture

11-36 Vol. 1
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)
Use of the 64-bit shift by bit instructions (PSRLQ, PSLLQ) can be extended to 128
bits in either of two ways:
Use of PSRLQ and PSLLQ, along with masking logic operations.
Rewriting the code sequence to use PSRLDQ and PSLLDQ (shift double
quadword operand by bytes)
Loop counters need to be updated, since each 128-bit SIMD integer instruction
operates on twice the amount of data as its 64-bit SIMD integer counterpart.
11.6.12 Branching on Arithmetic Operations
There are no condition codes in SSE or SSE2 states. A packed-data comparison
instruction generates a mask which can then be transferred to an integer register.
The following code sequence provides an example of how to perform a conditional
branch, based on the result of an SSE2 arithmetic operation.
cmppd XMM0, XMM1 ; generates a mask in XMM0
movmskpd EAX, XMM0 ; moves a 2 bit mask to eax
test EAX, 0,2 ; compare with desired result
jne BRANCH TARGET
The COMISD and UCOMISD instructions update the EFLAGS as the result of a scalar
comparison. A conditional branch can then be scheduled immediately following
COMISD/UCOMISD.
11.6.13 Cacheability Hint Instructions
SSE and SSE2 cacheability control instructions enable the programmer to control
prefetching, caching, loading and storing of data. When correctly used, these instruc-
tions improve application performance.
To make efficient use of the processor’s super-scalar microarchitecture, a program
needs to provide a steady stream of data to the executing program to avoid stalling
the processor. PREFETCHh instructions minimize the latency of data accesses in
performance-critical sections of application code by allowing data to be fetched into
the processor cache hierarchy in advance of actual usage.
PREFETCHh instructions do not change the user-visible semantics of a program,
although they may affect performance. The operation of these instructions is imple-
mentation-dependent. Programmers may need to tune code for each IA-32
processor implementation. Excessive usage of PREFETCHh instructions may waste
memory bandwidth and reduce performance. For more detailed information on the
use of prefetch hints, refer to Chapter 6, “Optimizing Cache Usage”, in the Intel® 64
and IA-32 Architectures Optimization Reference Manual.
The non-temporal store instructions (MOVNTI, MOVNTPD, MOVNTPS, MOVNTDQ,
MOVNTQ, MASKMOVQ, and MASKMOVDQU) minimize cache pollution when writing
non-temporal data to memory (see Section 10.4.6.2, “Caching of Temporal vs. Non-