Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture

Vol. 1 10-19
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)
The memory type of the region being written to can override the non-temporal hint,
if the memory address specified for the non-temporal store is in uncacheable
memory. Uncacheable as referred to here means that the region being written to has
been mapped with either an uncacheable (UC) or write protected (WP) memory type.
In general, WC semantics require software to ensure coherence, with respect to
other processors and other system agents (such as graphics cards). Appropriate use
of synchronization and fencing must be performed for producer-consumer usage
models. Fencing ensures that all system agents have global visibility of the stored
data; for instance, failure to fence may result in a written cache line staying within a
processor and not being visible to other agents.
For processors that implement non-temporal stores by updating data in-place that
already resides in the cache hierarchy, the destination region should also be mapped
as WC. If mapped as WB or WT, there is the potential for speculative processor reads
to bring the data into the caches; in this case, non-temporal stores would then
update in place, and data would not be flushed from the processor by a subsequent
fencing operation.
The memory type visible on the bus in the presence of memory type aliasing is imple-
mentation specific. As one possible example, the memory type written to the bus
may reflect the memory type for the first store to this line, as seen in program order;
other alternatives are possible. This behavior should be considered reserved, and
dependence on the behavior of any particular implementation risks future incompat-
ibility.
10.4.6.3 PREFETCHh Instructions
The PREFETCHh instructions permit programs to load data into the processor at a
suggested cache level, so that the data is closer to the processor’s load and store unit
when it is needed. These instructions fetch 32 aligned bytes (or more, depending on
the implementation) containing the addressed byte to a location in the cache hier-
archy specified by the temporal locality hint (see Table 10-1). In this table, the first-
level cache is closest to the processor and second-level cache is farther away from
the processor than the first-level cache. The hints specify a prefetch of either
temporal or non-temporal data (see Section 10.4.6.2, “Caching of Temporal vs. Non-
Temporal Data”). Subsequent accesses to temporal data are treated like normal
accesses, while those to non-temporal data will continue to minimize cache pollution.
If the data is already present at a level of the cache hierarchy that is closer to the
processor, the PREFETCHh instruction will not result in any data movement. The
PREFETCHh instructions do not affect functional behavior of the program.
See Section 11.6.13, “Cacheability Hint Instructions,” for additional information
about the PREFETCHh instructions.