Intel 64 and IA-32 Architectures Software Developers Manual Volume 3B, System Programming Guide Part 2

Table Of Contents
Vol. 3 18-111
DEBUGGING AND PERFORMANCE MONITORING
The extended cascading feature can be adapted to the sampling usage model for
performance monitoring. However, it is known that performance counters do not
generate PMI in cascade mode or extended cascade mode due to an erratum. This
erratum applies to Pentium 4 and Intel Xeon processors with model encoding of 2.
For Pentium 4 and Intel Xeon processors with model encoding of 0 and 1, the erratum
applies to processors with stepping encoding greater than 09H.
Counters 16 and 17 in the IQ block are frequently used in precise event-based
sampling or at-retirement counting of events indicating a stalled condition in the
pipeline. Neither counter 16 or 17 can initiate the cascading of counter pairs using
the cascade bit in a CCCR.
Extended cascading permits performance monitoring tools to use counters 16 and 17
to initiate cascading of two counters in the IQ block. Extended cascading from
counter 16 and 17 is conceptually similar to cascading other counters, but instead of
using CASCADE bit of a CCCR, one of the four CASCNTxINTOy bits is used.
Example 18-2. Scenario for Extended Cascading
A usage scenario for extended cascading is to sample instructions retired on logical
processor 1 after the first 4096 instructions retired on logical processor 0. A proce-
dure to program extended cascading in this scenario is outlined below:
1. Write the value 0 to counter 12.
2. Write the value 04000603H to MSR_CRU_ESCR0 (corresponding to selecting the
NBOGNTAG and NBOGTAG event masks with qualification restricted to logical
processor 1).
3. Write the value 04038800H to MSR_IQ_CCCR0. This enables CASCNT4INTO0
and OVF_PMI. An ISR can sample on instruction addresses in this case (do not
set ENABLE, or CASCADE).
4. Write the value FFFFF000H into counter 16.1.
5. Write the value 0400060CH to MSR_CRU_ESCR2 (corresponding to selecting the
NBOGNTAG and NBOGTAG event masks
with qualification restricted to logical
processor 0).
6. Write the value 00039000H to MSR_IQ_CCCR4 (set ENABLE bit, but not
OVF_PMI).
Another use for cascading is to locate stalled execution in a multithreaded applica-
tion. Assume MOB replays in thread B cause thread A to stall. Getting a sample of the
stalled execution in this scenario could be accomplished by:
1. Set up counter B to count MOB replays on thread B.
2. Set up counter A to count resource stalls on thread A; set its force overflow bit
and the appropriate CASCNTxINTOy bit.
3. Use the performance monitoring interrupt to capture the program execution data
of the stalled thread.