Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture

Vol. 1 2-11
INTEL
®
64 AND IA-32 ARCHITECTURES
wasted decode bandwidth due to branches or branch target in the middle of
cache lines
The operation of the pipeline’s trace cache addresses these issues. Instructions are
constantly being fetched and decoded by the translation engine (part of the
fetch/decode logic) and built into sequences of µops called traces. At any time,
multiple traces (representing prefetched branches) are being stored in the trace
cache. The trace cache is searched for the instruction that follows the active branch.
If the instruction also appears as the first instruction in a pre-fetched branch, the
fetch and decode of instructions from the memory hierarchy ceases and the pre-
fetched branch becomes the new source of instructions (see Figure 2-2).
The trace cache and the translation engine have cooperating branch prediction hard-
ware. Branch targets are predicted based on their linear addresses using branch
target buffers (BTBs) and fetched as soon as possible.
2.2.2.2 Out-Of-Order Execution Core
The out-of-order execution core’s ability to execute instructions out of order is a key
factor in enabling parallelism. This feature enables the processor to reorder instruc-
tions so that if one µop is delayed, other µops may proceed around it. The processor
employs several buffers to smooth the flow of µops.
The core is designed to facilitate parallel execution. It can dispatch up to six µops per
cycle (this exceeds trace cache and retirement µop bandwidth). Most pipelines can
start executing a new µop every cycle, so several instructions can be in flight at a
time for each pipeline. A number of arithmetic logical unit (ALU) instructions can
start at two per cycle; many floating-point instructions can start once every two
cycles.
2.2.2.3 Retirement Unit
The retirement unit receives the results of the executed µops from the out-of-order
execution core and processes the results so that the architectural state updates
according to the original program order.
When a µop completes and writes its result, it is retired. Up to three µops may be
retired per cycle. The Reorder Buffer (ROB) is the unit in the processor which buffers
completed µops, updates the architectural state in order, and manages the ordering
of exceptions. The retirement section also keeps track of branches and sends
updated branch target information to the BTB. The BTB then purges pre-fetched
traces that are no longer needed.