HP Caliper 5.3 User Guide (5900-1558, February 2011)

The average memory read latency on the dual-core Itanium 2 processor will appear greater
than on previous Itanium 2 processors. This is because the reported latency also includes the
latency that the arbiter adds to both the outbound request and inbound data transfer.
Avg Outstand
Average number of outstanding reads per cycle gives some idea of the memory request density,
that is, the probability of one or more memory requests per cycle. For control-dominated code
or for workloads that seldom miss the internal caches, this value will be very small. For
data-flow-type workloads, this number can, if extensive prefetching is employed, be quite
high, up to a maximum of 16, which is the Itanium 2 bus limit.
The reported average latency value will be incorrect on Itanium 2 steppings earlier than B2.
CPU
CPU transaction component is a measure of the percentage of all bus transactions generated
by all CPUs on a shared front side bus (FSB).
I/O
I/O transaction component is a measure of the percentage of all bus transactions initiated by
any I/O agent on a shared FSB.
Util Adrs
Average address bus utilization gives an estimate of total address bus utilization resulting
from all bus transactions to include cache misses, I/O port reads/writes, interprocessor
interrupts, writebacks, cache line invalidates (FC instruction, store hit on shared line), and
clean castouts (if enabled). The utilization is computed as follows:
ADRS UTIL = 100.0 * (total transactions/sec * 3.0) / bus cycles/sec
The constant value (3.0) is the number of address cycles needed for each bus transaction.
Util Data
Data bus utilization gives a lower bound estimate of total data bus utilization resulting from
bus transactions that result in a data transfer, that is, BRL, BRIL, BWL, and nonzero byte
BRP/BWP transactions. A lower bound data bus utilization is computed as follows:
DATA BUS CYCLES/SEC = ((BRL + BRIL + BWL + IMPLICIT WB)/sec * 4.0)
+
((nonzero byte BRP's/BWP's)/sec * 1.0)
DATA UTIL = 100 * (DATA BUS CYCLES/SEC) / BUS CYCLES SEC
The constants (4.0 and 1.0) represent the number of cycles that the data bus is occupied to
perform the requisite data transfer. All cache line transfers (brl, bril, bwl) require four cycles.
The nonzero BRP's/BWP's require one or two cycles (16, 32, 64 bytes). Since most of the
nonzero BRP's/BWP's are to I/O ports and semaphores, it was decided to assume a
single-cycle transfer. Thus, there is a small possibility of undercounting cycles.
BRL
Bus Read Line is the transaction used to read cache lines, due either to an instruction cache
miss or to a load data miss.
BRIL
Bus Read Invalidate Line is the transaction used when a store miss occurs, thus a read for
ownership. In Itanium 2, this transaction is also used when a store hit occurs on a shared line.
In this case, the BRIL is used to invalidate all remote copies on this cache line and have the
memory controller return the line we already have to the cache. Itanium 2 does not implement
the BIL optimization, which would have allowed remote copies to be invalidated without
performing a superfluous memory request.
248 Event Set Descriptions for CPU Metrics