Intel 64 and IA-32 Architectures Software Developers Manual Volume 3A, System Programming Guide, Part 1

Vol. 3A 10-9
MEMORY CACHE CONTROL
10.3.1 Buffering of Write Combining Memory Locations
Writes to the WC memory type are not cached in the typical sense of the word
cached. They are retained in an internal write combining buffer (WC buffer) that is
separate from the internal L1, L2, and L3 caches and the store buffer. The WC buffer
is not snooped and thus does not provide data coherency. Buffering of writes to WC
memory is done to allow software a small window of time to supply more modified
data to the WC buffer while remaining as non-intrusive to software as possible. The
buffering of writes to WC memory also causes data to be collapsed; that is, multiple
writes to the same memory location will leave the last data written in the location and
the other writes will be lost.
The size and structure of the WC buffer is not architecturally defined. For the Intel
Core 2 Duo, Intel Core Duo, Pentium M, Pentium 4 and Intel Xeon processors; the WC
buffer is made up of several 64-byte WC buffers. For the P6 family processors, the
WC buffer is made up of several 32-byte WC buffers.
When software begins writing to WC memory, the processor begins filling the WC
buffers one at a time. When one or more WC buffers has been filled, the processor
has the option of evicting the buffers to system memory. The protocol for evicting the
WC buffers is implementation dependent and should not be relied on by software for
system memory coherency. When using the WC memory type, software must be
sensitive to the fact that the writing of data to system memory is being delayed and
must deliberately empty the WC buffers when system memory coherency is
required.
Once the processor has started to evict data from the WC buffer into system
memory, it will make a bus-transaction style decision based on how much of the
buffer contains valid data. If the buffer is full (for example, all bytes are valid), the
processor will execute a burst-write transaction on the bus. This results in all 32
bytes (P6 family processors) or 64 bytes (Pentium 4 and more recent processor)
being transmitted on the data bus in a single burst transaction. If one or more of the
WC buffer’s bytes are invalid (for example, have not been written by software), the
processor will transmit the data to memory using “partial write” transactions (one
chunk at a time, where a “chunk” is 8 bytes).
This will result in a maximum of 4 partial write transactions (for P6 family processors)
or 8 partial write transactions (for the Pentium 4 and more recent processors) for one
WC buffer of data sent to memory.
The WC memory type is weakly ordered by definition. Once the eviction of a WC
buffer has started, the data is subject to the weak ordering semantics of its defini-
tion. Ordering is not maintained between the successive allocation/deallocation of
WC buffers (for example, writes to WC buffer 1 followed by writes to WC buffer 2 may
appear as buffer 2 followed by buffer 1 on the system bus). When a WC buffer is
evicted to memory as partial writes there is no guaranteed ordering between succes-
sive partial writes (for example, a partial write for chunk 2 may appear on the bus
before the partial write for chunk 1 or vice versa).
The only elements of WC propagation to the system bus that are guaranteed are
those provided by transaction atomicity. For example, with a P6 family processor, a