Intel 64 and IA-32 Architectures Software Developers Manual Volume 1, Basic Architecture

Vol. 1 2-11

INTEL

64 AND IA-32 ARCHITECTURES

• wasted decode bandwidth due to branches or branch target in the middle of

cache lines

The operation of the pipeline’s trace cache addresses these issues. Instructions are

constantly being fetched and decoded by the translation engine (part of the

fetch/decode logic) and built into sequences of µops called traces. At any time,

multiple traces (representing prefetched branches) are being stored in the trace

cache. The trace cache is searched for the instruction that follows the active branch.

If the instruction also appears as the first instruction in a pre-fetched branch, the

fetch and decode of instructions from the memory hierarchy ceases and the pre-

fetched branch becomes the new source of instructions (see Figure 2-2).

The trace cache and the translation engine have cooperating branch prediction hard-

ware. Branch targets are predicted based on their linear addresses using branch

target buffers (BTBs) and fetched as soon as possible.

2.2.2.2 Out-Of-Order Execution Core

The out-of-order execution core’s ability to execute instructions out of order is a key

factor in enabling parallelism. This feature enables the processor to reorder instruc-

tions so that if one µop is delayed, other µops may proceed around it. The processor

employs several buffers to smooth the flow of µops.

The core is designed to facilitate parallel execution. It can dispatch up to six µops per

cycle (this exceeds trace cache and retirement µop bandwidth). Most pipelines can

start executing a new µop every cycle, so several instructions can be in flight at a

time for each pipeline. A number of arithmetic logical unit (ALU) instructions can

start at two per cycle; many floating-point instructions can start once every two

cycles.

2.2.2.3 Retirement Unit

The retirement unit receives the results of the executed µops from the out-of-order

execution core and processes the results so that the architectural state updates

according to the original program order.

When a µop completes and writes its result, it is retired. Up to three µops may be

retired per cycle. The Reorder Buffer (ROB) is the unit in the processor which buffers

completed µops, updates the architectural state in order, and manages the ordering

of exceptions. The retirement section also keeps track of branches and sends

updated branch target information to the BTB. The BTB then purges pre-fetched

traces that are no longer needed.