user manual

132 AMD Athlon Processor Microarchitecture
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
replacement is based on a least-recently used (LRU)
replacement algorithm.
The L1 instruction cache has an associated two-level translation
look-aside buffer (TLB) structure. The first-level TLB is fully
associative and contains 24 entries (16 that map 4-Kbyte pages
and eight that map 2-Mbyte or 4-Mbyte pages). The second-level
TLB is four-way set associative and contains 256 entries, which
can map 4-Kbyte pages.
Predecode
Predecoding begins as the L1 instruction cache is filled.
Predecode information is generated and stored alongside the
instruction cache. This information is used to help efficiently
identify the boundaries between variable length x86
instructions, to distinguish DirectPath from VectorPath
early-decode instructions, and to locate the opcode byte in each
instruction. In addition, the predecode logic detects code
branches such as CALLs, RETURNs and short unconditional
JMPs. When a branch is detected, predecoding begins at the
target of the branch.
Branch Prediction
The fetch logic accesses the branch prediction table in parallel
with the instruction cache and uses the information stored in
the branch prediction table to predict the direction of branch
instructions.
The AMD Athlon processor employs combinations of a branch
target address buffer (BTB), a global history bimodal counter
(GHBC) table, and a return address stack (RAS) hardware in
order to predict and accelerate branches. Predicted-taken
branches incur only a single-cycle delay to redirect the
instruction fetcher to the target instruction. In the event of a
mispredict, the minimum penalty is ten cycles.
The BTB is a 2048-entry table that caches in each entry the
predicted target address of a branch.
In addition, the AMD Athlon processor implements a 12-entry
return address stack to predict return addresses from a near or
far call. As CALLs are fetched, the next EIP is pushed onto the