HP Caliper User Guide Release 5.5 HP Part Number: 5900-2351 Published: September 2012 Edition: 5.
© Copyright 2012 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Contents About This Document...................................................................................12 1 HP Caliper at a Glance............................................................................16 What Is HP Caliper?...............................................................................................................16 What Does HP Caliper Run On?..............................................................................................
Simultaneous fprof Sampling on Multiple PMU Counters..............................................................45 Location of Measurement Configuration Files..............................................................................45 Specifying Option Values in Measurement Configuration Files..................................................45 Using the Command Line to Override Measurement Configuration File Parameters......................46 5 HP Caliper Options.....................................
--exclude-caliper (HP-UX only) ......................................................................................61 --exclude-idle (HP-UX only) ..........................................................................................61 --fold.....................................................................................................................................61 --frame-depth....................................................................................................................
--version.................................................................................................................................75 6 Using the HP Caliper Advisor.....................................................................76 What Is the HP Caliper Advisor?..............................................................................................76 Example of an HP Caliper Advisor Report..................................................................................
Function Details....................................................................................................................110 Disassembly Listing...............................................................................................................110 Branch Targets in Disassembly Listings................................................................................111 Source Position Correlation.............................................................................................
Process Memory Usage Table.......................................................................................156 Measuring System Usage Concurrently with Other Measurements (HP-UX only) .......................157 Example Report Output.....................................................................................................157 Interpreting the Data........................................................................................................
Metrics for Integrity Servers Intel® Itanium® 9500 Processors Systems ...............................188 cycles Measurement Metrics..............................................................................................190 How cycles Metrics Are Obtained......................................................................................190 dcache Measurement Report Description.................................................................................191 Example Command Line for Text Report....
How fprof Metrics Are Obtained........................................................................................212 icache Measurement Report Description...................................................................................213 Example Command Line for Text Report..............................................................................213 Example Command Line for CSV Report..............................................................................
fp Event Set..........................................................................................................................241 Correspondence Between Floating-Point Instructions and Operations.......................................241 Metrics Available from this Measurement............................................................................242 l1dcache Event Set...............................................................................................................
About This Document This document describes how to use HP Caliper to measure the performance of native applications running on HP-UX and Linux Integrity servers. NOTE: For the latest version of this document, go to the HP Caliper Web site at the following URL and click on Documentation in the Product Information box: http://hp.com/go/caliper This document is sometimes updated after a release. The document publication date appears on the title page.
For information about the HP Caliper Advisor, read this chapter: • “Using the HP Caliper Advisor” (p. 76). For information about how to configure HP Caliper to collect data and report the results, read these chapters: • “Configuring HP Caliper ” (p. 91) describes how you can configure HP Caliper to collect data. • “Controlling the Content of Reports” (p. 104) describes how to control the content of reports based on the data collected.
GUI item A graphical user interface (GUI) item such as a button or menu name. [] The contents are optional in syntax. If the contents are a list separated by |, you must choose one of the items. {} The contents are required in syntax. If the contents are a list separated by |, you must choose one of the items. ... The preceding element can be repeated an arbitrary number of times. | Separates items in a list of choices.
• Using HP Caliper to analyze effective floating-point load latency • Using HP Caliper with an application program to characterize the Itanium memory hierarchy • Using HP Caliper to measure performance data related to translation lookaside buffers (TLBs) You can also read these technical reports about the microarchitecture used in HP Integrity servers: • Dual-Core Update to the Intel® Itanium® 2 Processor Reference Manual for Software Development and Optimization, Document Number 308065-001.
1 HP Caliper at a Glance What Is HP Caliper? HP Caliper is a general-purpose performance analysis tool for applications on HP-UX and Linux systems running on HP Integrity Servers. HP Caliper allows you to understand the performance and execution of your application and to identify ways to improve its run-time performance. HP Caliper works with any native Integrity Server application.
Figure 1 HP Caliper Components (User Interfaces) HP Caliper CLI Application Performance reports HP Caliper HP Caliper GUI (local) X11 server HP Caliper GUI (remote) HP Caliper database(s) Integrity Server (HP-UX or Linux) X86 desktop (Windows or Linux) HP Caliper selectively measures the processes, threads, and load modules of your application.
In general, HP Caliper runs do one of the following: • Collect data • Collect data and generate a report • Generate a report based on previously collected data • Analyze previously collected data For the last item above, HP Caliper provides the HP Caliper Advisor, a rules-based expert system designed to provide guidance about improving the performance of an application. Users can write their own rules to analyze applications or use the default rules provided.
Summary of HP Caliper Features HP Caliper's most important features include the following: • Performance data is automatically saved in databases, which you can use to generate reports without having to remake the measurements. Multiple databases can also be combined for aggregated results. • All reports are available in text format and comma-separated-value (CSV) format for use with spreadsheets.
2 Getting Started with the HP Caliper Command-Line Interface This chapter provides some example programs to show you how to get started using the HP Caliper command-line interface. The programs are chosen for illustration purposes and are not necessarily representative of programs you might actually want to analyze. Example: Running fprof on a Short Program, with Default Output HP Caliper provides many types of performance measurements.
Figure 2 fprof Measurement Report for matmul, with Default Report Output ================================================================================ HP Caliper 4.3.
Target Execution Time 10 Real time: 0.428 seconds User time: 0.415 seconds System time: 0.008 seconds Sampling Specification 11 Number of samples: 1319 Data sampled: IP Metrics Summed for Entire Run 12 ----------------------------------------------PLM Event Name U..K TH Count ----------------------------------------------CPU_CYCLES x___ 0 659001879 BACK_END_BUBBLE.ALL x___ 0 99866365 BE_EXE_BUBBLE.
3 ~83 > --------------------------------------------------[Minimum function entries: 0, percent cutoff: 1.00, cumulative percent cutoff: 100.00] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 HP Caliper 4.3.0: Report Summary for Flat Profile: The heading for the report, including the HP Caliper version number and the measurement (Flat Profile). Collection Run 1: (Flat Profile): The heading for the run. Processor Information: Information about your processor. Run Information: Information about the run.
• Line | Slot | Col,Offset The column contains one of these: ◦ A source-code line number for rows showing statements ◦ An instruction slot number for rows showing instructions not on a bundle boundary ◦ A source-code column number followed by an offset from the beginning address of a function for rows showing instructions on a bundle boundary • 16 >Statement | Instruction: The column contains either a source statement, preceded by “>”, or a disassembled instruction.
Figure 3 fprof Measurement Report for matmul, with IP Sample Counts for One Function Function Details --------------------------------------------------% Total Line| IP IP Slot| >Statement| Samples Samples Col,Offset Instruction --------------------------------------------------1 96.88 [matmul::main, 0x40009a0, matmul.c] 1275 ~38 Function Totals 2 -----------------------------------------3 [/home/meagher/matmul.c] 4 (32) ~16 *> mata[i][j] = matb[i][j] = (float) rand() ; ~9,0x0280:0 M nop.m 0 :1 M nop.
Types of Measurements HP Caliper is capable of three types of performance measurement: • A global measurement of total run metrics • A sampled measurement based on the granularity you specify • A precise measurement of every execution path in your code (HP-UX only) See Table 1 (page 43). Global Measurement A global measurement gives you a single value for a specific metric of your program, such as total CPU time used. The only global measurement available in HP Caliper is ecount (total CPU time).
Precise measurements are best used for: • Identifying the most and least used functions in your program • Identifying all the branch paths executed in the program Collecting precise measurements requires more system resources than sampled measurements. Collecting precise measurements also affects the performance of the program being measured. The performance effects may vary from a few percent to 300 percent, depending on how much measurement you request.
caliper_options Parameters used to customize the performance analysis. For more information, see “HP Caliper Options” (p. 47). program The name of the executable program you want HP Caliper to measure. program_arguments Any number of arguments expected by your executable. You can use an options file to specify command-line information, including the measurement, options, program, and program arguments. See “-f or --options-file” (p. 49) for details.
The first command produces a call graph by sampling. The second command (on HP-UX only) produces an exact call graph. They both produce an enhanced gprof-like output. Creating a Text Report for Analysis To save the report produced by HP Caliper to a file, specify an output file name: $ caliper measurement -o filename [caliper_options] program [program_arguments] Creating a Report Based on Your Collected Data By default, HP Caliper saves the results of a measurement to a database.
Additional HP Caliper Commands In addition to the caliper measurement command, there are three more HP Caliper commands you can use. For information about these commands, including required syntax, see the references below: • caliper info Displays reference information about the CPU counters or reports. See “How to Display Reference Information About CPU Counters or HP Caliper Report Types” (p. 101). • caliper report | merge | diff Creates a report from an HP Caliper database.
3 Getting Started with the HP Caliper GUI In addition to the command-line interface, HP Caliper supports a full-featured, intuitive graphical user interface (GUI). This chapter describes how to get started using the GUI. For information on the command-line interface, see Chapter 2 (page 20). What Is the HP Caliper GUI? The GUI has the same underlying measurement technology and capabilities as the command-line interface. With the GUI, however, you can dynamically interact with HP Caliper.
• Diagnostics view • Help view As is typical of most GUIs, the HP Caliper GUI lets you reconfigure, resize, and reposition all of the views to suit your needs. Views that are not currently needed can be closed (and reopened when needed) to make more room for others.
Each HP Caliper measurement run produces several datasets. These datasets are shown in the Projects view for each run. The figure below shows the Projects view: Figure 5 Projects View Collect View The Collect view allows you to set up and make performance measurements. It consists of a series of tabbed pages (which are not themselves views) containing all the information needed to run your application and all the measurement parameters that you can control.
Figure 6 Collect View Analyze View The Analyze view lets you explore the performance data you collect. When displayed, the Analyze view is located, by default, to the right of the Projects view, overlaying the Collect and Advisor views. Any performance data you have available for viewing is shown in the Projects view. To open the Analyze view, double-click a performance data icon of interest in the Projects view.
Figure 7 Analyze View Advisor View The Advisor view contains a set of suggestions for improving the performance of your application based on the data collected so far. When displayed, the Advisor view is located, by default, to the right of the Projects view, overlaying the Collect and Analyze views. To open the Advisor view, click on the Generate Advice button analyze the collected data and produce advice output. or toolbar choice.
Figure 8 Advisor View Console View The Console view displays any output your application writes to standard output and standard error streams. You can also use the Console view to provide any input your application expects to read from standard input. The Console view is below the Collect view, by default, and is visible when your application is being measured.
Diagnostics View The Diagnostics view contains any warning messages that HP Caliper might generate when measuring your application or retrieving its performance data for viewing. By default, this view overlays the Console view at the bottom of the GUI window. Any errors produced will appear in popup dialogs.
Tips for Using Views All views have the following features: • Each view has its own Maximize and Minimize buttons (top right), and many views have their own pull-down menus (also top right). • Double-clicking a view's tab causes the view to take up the entire GUI window. Double-clicking a view's tab a second time returns it to its previous size and restores the previous GUI layout. This feature is particularly useful when viewing performance data.
• A measurement run will end whenever one of the following occurs: ◦ The measured application completes. ◦ All the attached processes terminate. ◦ The measurement duration you set on the Target page expires. ◦ You select the Kill/Stop button. The application program being measured will be terminated immediately if you select the Kill button. • When a measurement run completes, its performance data is automatically added to the current project within the Projects view.
Getting Help Several forms of online help are available in the GUI: • “Getting started” help Select Help→Help Contents and then choose Getting Started. • Dynamic/context help Select Help→Context-sensitive Help or use the F1 key. This help provides detailed information specific to the view that currently has focus. • Reference help Select Help→Help Contents.
You will need to copy the appropriate GUI client (in the gui_clients subdirectory) to your Windows or Linux desktop system and unpack it. Then, start the GUI from your desktop using the following executable file. Invoke it from a shell prompt or double-click it in a folder: • On Windows: Caliper.exe • On Linux: Caliper At startup, the GUI prompts you for the login information needed to connect it to the remote HP Caliper server on the Integrity system where you want to make measurements. See Figure 12.
4 HP Caliper Measurement Configuration Files Each run of HP Caliper uses a particular measurement, which you can specify in the command line. Each measurement corresponds to a particular measurement configuration file supplied by HP Caliper. The measurement configuration files contain variables that control the types of measurements performed and the content of the reports.
• dtlb The dtlb measurement measures and reports sampled data translation lookaside buffer (TLB) misses. See “dtlb Measurement Report Description” (p. 199). • ecount The ecount measurement measures and reports total CPU event counts. See “ecount Measurement Report Description” (p. 204). • fcount (HP-UX only) The fcount measurement measures and reports function call counts in a program. See “fcount Measurement Report Description ” (p. 206).
Table 1 Available Measurements in Each Measurement Type (continued) Global Sampled Precise (HP-UX only) itlb pmu_trace scgprof traps NOTE: The cgprof measurement performs both sampled and precise measurements. The measurements in the sampled category, with the exception of cpu and pmu_trace, show results grouped by function. A report produced by any of these measurements is referred to as a PMU histogram report.
use 5% as the sampling period variation. The cstack measurement will take one sample for every 250 milliseconds. On HP-UX, the per-process overview measurement requires a 11i v3 system with the kernel patch PHKL_38072 installed. Simultaneous fprof Sampling on Multiple PMU Counters Up until Caliper 5.1, the fprof measurement sampled the instruction pointer (IP) on only one counter (every 500,000 CPU cycles by default). As of Caliper 5.
Using the Command Line to Override Measurement Configuration File Parameters You can use the HP Caliper command line to override parameters specified in measurement configuration files.
5 HP Caliper Options This chapter describes basic information about options and presents them in alphabetical order. For a listing of the most commonly used options, see the HP Caliper Quick Start reference card. Basic Information About Options Options are used to customize the performance analysis. You can specify one or more options on the command line when you start HP Caliper. You can abbreviate options and their modifiers as long as they are unambiguous.
Hierarchy for Processing an Option Value HP 1. 2. 3. 4. Caliper uses this sequential order to process an option value: Default value for an option Option variable setting in the specified measurement configuration file Option variable setting in the .caliperinit file, if the file exists Option value from the command line Thus: • The command line overrides everything. • The .caliperinit file overrides the measurement configuration file. • The measurement configuration file overrides the default value.
-f or --options-file -file options_file Specifies a text file containing a list of HP Caliper command-line options separated by spaces or line breaks. You can also use an options file to specify an HP Caliper measurement as well as the application to be profiled and its arguments. Any option you specify on the command line overrides the corresponding setting in the options file. HP Caliper places the contents of the options file in the position occupied by the -f option in the command line.
If you use this option but do not specify an event, or if the option value is set to the empty string (""), then no metrics will be reported. You can use the caliper info command to list available CPU events and their descriptions. cpu_event Specifies a CPU event to measure. The name is not case-sensitive. For information about CPU events you can specify, see “Specifying Which CPU Events to Measure” (p. 93).
per-process Creates individual report files for each process with program name appended to each file. shared Creates a single file containing the results for all processes. This is the default setting. unique Appends the process ID to the data file name. When saving information per process, you can specify a nonexistent directory for the file name and HP Caliper will create the directory and put all the files in the new directory.
Default value is module:directory:file:function:unknown. module Shows data by load module. directory Groups data by source directory. file Generates Summary Report by source file. function Shows function level detail by source file. unknown When used together with the other report options, provides additional information about functions from unknown source files in the summary and detail coverage reports.
using a percent symbol (%). For example: -s CPU_CYCLES,10000,10% The default value is 5 percent. cpu_event Specifies a CPU event to measure. The name is not case-sensitive. For information about CPU events you can specify, see “Specifying Which CPU Events to Measure” (p. 93). threshold=int An integer value that specifies how HP Caliper counts events: • If the value is zero, HP Caliper counts all events.
--advice-cutoff Used only with the caliper advise command. See “Command Line to Invoke the Advisor” (p. 78). --advice-details Used only with the caliper advise command. See “Command Line to Invoke the Advisor” (p. 78). --analysis-focus Used only with the caliper advise command. See “Command Line to Invoke the Advisor” (p. 78).
runs in user space (user). The privilege levels available are: • user • kernel • all You can abbreviate this qualifier to PLM. The qualifier is not case-sensitive. This option overrides the branch_sampling_spec setting in the scgprof measurement. When you specify the setting by using the command-line option, you can override all or just part of the specification. This allows you to, in effect, create your own default settings.
For more information, see Chapter 11 (page 133). --context-lines --context-lines all|count_source[,count_disassembly] For a PMU histogram report, specifies the number of source lines to show before and after a source line entry with associated performance data. Default value is --context-lines 5 for source-only reports or --context-lines 0 for reports with disassembly. Specify all to report all source lines for reported functions.
The CSV reports are formatted to load in an easy-to-read format in a spreadsheet using a fixed-width typeface such as Courier or for further processing. The filename is the destination. You can generate reports in CSV or text formats on any given run. append Adds the report results to the end of an existing file that has the specified name. create Creates a file with the specified name and writes the report results to the file. Replaces any existing file with the specified name.
PLM specifies the privilege level setting. The privilege levels available are: "user", "kernel", and "all". ADDR_MATCH is the 64-bit address to match. ADDR_MASK is the 56-bit address mask to apply before matching the ADDR_MATCH bits. PROC_FLAGS is a comma-separated list of none , d, io, or iod. none indicates no constraint. d indicates data address matching only. io indicates instruction address and opcode matching. iod indicates instruction address, opcode and data address matching.
HP Caliper stops reporting information when it reaches either a percent cutoff or a cumulative percent cutoff: • You can limit the report only to functions that exceed a specified percentage of the total for the sorting/cutoff metric. Once HP Caliper encounters this percent cutoff, it stops reporting functions. • You can limit the report by having HP Caliper stop reporting functions once the cumulative percent of the functions so far listed exceeds the cumulative percent cutoff value.
Example To produce a flat profile based on execution stalls due to GR/GR or GR/load: $ caliper cycles -ra -s 100000,5%,BE_EXE_BUBBLE.GRALL --etb-freeze-delay 12 --etb-walkback-cycles 22 Choosing correct values for the --etb-freeze-delay and --etb-walkback-cycles options requires knowledge of the Integrity servers dual-core Itanium 2 or Itanium 9300 quad-core processor pipeline. For information about the ETB and how it is configured to collect IP values, see Performance Profiling for Fun and Profit.
Example Assume that you want to capture samples containing the: • Number of cycles during which three or more FP_OPS_RETIRED events occurred, while executing in kernel space • Number of cycles during which four or more NOPS_RETIRED events occurred, in user space • Total number of NOPS_RETIRED at all levels In addition, assume that you want the samples to be captured every time 10,000 cycles occur, during which two or more IA64_INST_RETIRED events occur, at all privilege levels.
executable Specifies that “matching” processes should have their data combined when possible. This is the default for all HP Caliper reports and for all caliper report, caliper merge, and caliper diff commands. module Specifies that “matching” modules should have their data combined. This produces a module-centric report. In a module-centric report, there is no data about individual processes in the collection runs.
download.intel.com/design/Itanium2/manuals/30806501.pdf and Intel Itanium 2 Processor Reference Manual for Software Development and Optimization, Document Number 251110-003. http://www.intel.com/design/itanium2/ manuals/251110.htm --info --info Causes HP Caliper to append help information to the end of textual reports. --inlines --[no]inlines Causes HP Caliper to collect data for inline functions. The default value is --noinlines.
By default, HP Caliper uses this kernel file for symbol lookup and disassembly: • /stand/current/vmunix NOTE: On Linux, a default kernel path is not defined for a sampling level of kernel or all, so reports show only kernel module and function information for samples. To show disassembled instructions for kernel modules, use the –-kernel-path option. To produce an uncompressed kernel image for HP Caliper to work with, do the following: TMP_FILE=path_to_use_for_kernel_image gunzip -c /boot/efi/...
NOTE: With this HP Caliper option, you must use a qualifier or an equals sign (=). You cannot use --memory-usage as the option. You must use --memory-usage=. If you specify --memory-usage=, then --memory-usage=all is assumed.
See “Specifying Which Load Modules to Collect Data For” (p. 94). --module-search-path --module-search-path directory[:directory[:...]] Specifies a list of directories to search when a load module file (executable or shared library) cannot be found by using the path obtained from the process. This typically happens if the measured process uses chroot(2) or chdir(2) and then loads libraries or executes other binaries using relative paths.
--options-file See “-f or --options-file” (p. 49). --output-file See “-o or --output-file” (p. 50). --overflow-block --overflow-block True|False Specifies whether the target application should be blocked when the PMU sampling buffer is full. The default is TRUE (i.e., the target application will be blocked until HP Caliper has completed processing all the samples in the buffer). This option is valid only for PMU based per-process measurements on Linux.
This option has these parameters: percent_cutoff The percentage of the total for the blocked samples that a given primitive must exceed to appear in a report. This is shown as percent cutoff on reports. Default value is 1.0. cum_percent_cutoff The value of the cumulative percentage at which HP Caliper stops reporting results. This is shown as cumulative percent cutoff on reports. Default value is 100. min_count Sets the minimum number of primitives to be displayed. Default value is 10.
If --group-by none is specified, then the Process Summary section will potentially have multiple entries for processes with the same basename. For more information, see “Metrics You Can Use for Report Sorting and Cutoffs” (p. 105). Example If you specify: $ caliper fprof --process-cutoff ,80,0 -w The contents of the Process Summary section is a list of processes containing: • The processes that account for 80 percent of the total IP samples of all the processes running in the system.
(You can specify the privilege level as user, kernel, or all with the --event-defaults option. The default value is all.) Every processor set (pset) is measured. The samples can be attributed to processes, or to processes and modules, or not attributed. For example: • --scope system,attr-mod Measure for system activity, and attribute samples to processes and modules within those processes whenever possible.
When --scope system is used, for most measurements, HP Caliper measures all user and kernel activity: either all user and kernel activity or individual processes or the modules of those processes. When --scope system is used, HP Caliper continues collecting data until you stop it with Ctrl-C. You can also specify the number of seconds to collect data with the -e option. For example, to create a Flat Profile (fprof) report for all activity on the system for 20 seconds: $ caliper fprof -o fprof.
--source-path-map --source-path-map pathmap1[:pathmap2:...] Specifies the path map to use for finding source files used for reporting source statements. Applies to any PMU histogram report, which is the only kind of report that references source code. Path map entries are separated by a colon (:) and applied in order until HP Caliper finds a file match. • Simple entries are prepended to file names. • You can provide substitute paths by using comma-separated entries.
This value only takes effect if the cumulative percent column is selected for the report. Sets the minimum number of functions to be displayed for all load modules. min_count Default value is 5. For example, if you specify the command line: caliper fprof --summary-cutoff ,80 wordplay The contents of the function summary section will be a list of functions containing: • The functions that account for 80 percent of the total IP samples in the wordplay program.
NOTE: With this HP Caliper option, you must use a qualifier or an equals sign (=). You cannot use --system-usage as the option. You must specify --system-usage=. If you specify --system-usage=, then --system-usage=all is assumed. For details, see “Measuring System Usage Concurrently with Other Measurements ” (page 157). --term-display The --term-display option is no longer supported. --threads --threads sum-all|all Enables per-thread reporting. Default value is all.
For information about all the traps you can specify, see “traps Measurement Report Description ” (page 225). --user-regions --user-regions default|rum-sum For runs involving the PMU, specifies whether the data should be collected for the entire run (--user-regions default), or only in regions delimited by the PMU enable/disable instructions rum and sum. For more information, see “Restricting PMU Measurements to Specific Code Regions” (p. 162). --version See “-v or --version” (p. 53).
6 Using the HP Caliper Advisor This chapter introduces you to the HP Caliper Advisor and provides some example programs to show you how to get started using the Advisor from the command line. For information on how to use the Advisor in the HP Caliper graphical user interface (GUI), see Chapter 7 (page 85). For details about how to write rules for the Advisor, see the HP Caliper Advisor Rule Writer Guide.
Example 1 HP Caliper Advisor Report =========================================================================== HP Caliper 4.3.0 Advisor Report for my_app =========================================================================== Analysis Focus Executable: Last modified: Processor type: Processor speed: OS version: /tmp/my_app August 15, 2004 at 03:10 PM Itanium2 9M 1599 MHz HP-UX 11.23 Performance Databases /home/me/.hp_caliper_databases/cpu - March 23, 2005 at 11:17 AM /home/me/.
Figure 13 Steps in Using the Advisor Ma ke sugg ested chang es Buil d appl icat ion Start On e or more HP Calip er performanc e runs HP Calip er Advisor Gain better und erstandin g of appl icat ion performanc e End Ma ke sugg ested performanc e runs To use the HP Caliper Advisor, you perform these steps: 1. Build the application with an initial set of compiler/linker options. 2.
–o outputfile[,append|create] --rule-files rulefile1[,rulefile2,...] For these options: --advice-classes Specifies which classes of advice are printed. It can be all or any combination of general, cpu, memory, io, or system, separated by colons (:). The default is all. --advice-cutoff Specifies how much of the advice to print. All advice is sorted by its index value (the greater the index, the greater the importance). min-index specifies the lowest index value of advice to print.
As with the HP Caliper command-line options, each of the Advisor’s command-line options has a variable counterpart in the .caliperinit file that can set an option value. The variable name is the same as the option, with hyphens (-) replaced with underscores (_). Later uses of the same command-line option or .caliperinit file variable overrides earlier uses. Getting Started with the Advisor: Examples To run the Advisor, you need to make one or more HP Caliper measurement runs on an application.
or: $ caliper ecount my_new_app followed by: $ caliper fprof my_new_app $ caliper dcache my_new_app Then, run the Advisor on the composite performance data: $ caliper advise Explanation of Report Output Figure 14 (page 81) shows the report output from the Advisor. The report is explained further in “How to Read an Advisor Report” (p. 82). The numbers (which are bold in the PDF version of this guide) are annotations to explain the report—they are not part of the output you receive.
2 3 4 processor type and speed, and operating system version. Performance databases being analyzed. Rule files that were used. Advice section, giving performance tuning advice. 6 7 Second piece of advice, set off by a line of dashes (--------). Cutoff settings, which specify how much of the advice to print. This was run on an HP-UX 11i V2 September 2004 OE system. Reports run on other systems look similar, except that the specific advice given is unique to the application and the system.
------------------------------------------------------------------------------23.9 cpu Function profile 1 [cpu_fprof_1] 2 The percentage of ITLB misses (16.6%) is higher than normal. This may indicate a poor setting for the virtual memory instruction page size. 3 Try adding "+pi 4M" to the application's link command. 4 Use the following Caliper command to get a source listing of the 'hot spots' in these routines: 5 caliper report fprof 1 2 3 Description: Brief text describing what the advice is about.
• The ordering of rule files and databases on the command line makes no difference to the results produced by the Advisor. The only exception is in the case where the databases contain data from different, incompatible systems for the same executable object. • If you want to use multiple rule files, consider writing a “super” rule file that merely ‘includes’ the real rule files. If you do this, only the super rule file needs to be given on the command line.
7 Using the HP Caliper Advisor in the GUI This chapter describes how to use the HP Caliper Advisor in the HP Caliper graphical user interface (GUI). It assumes that you have some familiarity with the Advisor. For information about the HP Caliper Advisor, see Chapter 6 (page 76). For information about the HP Caliper graphical user interface (GUI), see Chapter 3 (page 31).
Figure 15 HP Caliper GUI In this screen shot of the GUI, you can see that three measurement runs have already been made: two in the Before Changes project (a CPU Cycles Run and a Data Cache Misses Run) and one in the After Changes project (a CPU Cycles Run). The application being measured is the HP C/C++ compiler, compiling the “Hello World” program. The application consists of three processes: cc, ecom, and ld. Note that these are default measurement runs.
In either case, you select an entire project or a measurement run by clicking on its name in the Projects view. You can select more than one item (on Windows) by holding the Ctrl key while selecting the additional ones. Figure 16 shows the Projects view, with a single project, Before Changes, selected. Every measurement run in the project is also selected.
Figure 17 Projects View, with a Single Measurement Run Selected Generating Advice The easiest step is getting the HP Caliper Advisor to analyze the selected performance data and generate advice. Figure 18 shows the GUI toolbar. The square icon with a blue checkmark inside means check the performance data. If you “hover ” over the icon, the popup tooltip says Generate Advice. Simply click on the icon.
Figure 19 HP Caliper GUI Advisor Menu Generate Advice does the same thing as the toolbar icon: generate new advice from the selected performance data and display it in an Advisor view. Show Advisor View brings up the Advisor view with the advice from the last analysis run. You can use this option to retrieve the Advisor view if you previously closed it. This action also appears in the Window/Show View menu.
Figure 20 Advisor Report in the HP Caliper GUI The individual (potential) performance issues are separated by horizontal lines. The first line of each section gives five pieces of information: the name of the executable, an index value for the issue, which category or advice class (CPU, memory, I/O, and so forth) the issue falls in, a brief description of the performance issue, and the name of the Advisor rule that detected this issue.
8 Configuring HP Caliper HP Caliper gives you multiple methods for configuring how HP Caliper collects data and reports results. Specifying Option Values with a .caliperinit Initialization File If you have an initialization file (called .caliperinit), HP Caliper automatically uses it at startup for data collection or data reporting runs. Putting the options in an initialization file simplifies the command line you use. This file is not required, but can be useful.
Figure 21 .caliperinit File ******************************************************************** #Options applied to all report types. application ='myapp' arguments = '-myarg 2' context_lines = 0,3 summary_cutoff = 1 detail_cutoff =5 source_path_map = '/proj/src,/net/dogbert/proj/src:/home/wilson/work' #Report-specific options.
Configuring Data Collection HP Caliper gives you flexible control over the data you collect from your program. The types of control you have include: • Particular CPU events to measure. See “Specifying Which CPU Events to Measure” (p. 93). • Specific load modules you want to collect data for. See “Specifying Which Load Modules to Collect Data For” (p. 94). • Granularity of the information. See “Controlling Granularity of Data Collection and Reports” (p. 96). • Particular processes to measure.
HP Caliper: usage error: Ambiguous event abbreviation ("IIR") specified for "--sampling-spec". Matches IIR2 (IA64_INST_RETIRED), IIR1 (IA32_INST_RETIRED) Run caliper -h for help.
module-default all module-include libdl.so module-exclude • uld.so • dld.so • libsin.so You cannot override the settings for uld.so, dld.so, and libsin.so. How to Specify Load Module Names HP Caliper matches load module names in the following way: • If you provide a full path for the module name, only an exact match succeeds. • To imply all modules within a directory and its subdirectories, you provide a directory name with a trailing slash (/).
Controlling Granularity of Data Collection and Reports You can control the granularity of data collection and reports. If you want finer granularity (that is, more samples), use the -s option to lower the number of events between samples. For example, you can change the rate from the default 500,000 cycles to 250,000 cycles to get more samples. However, the increased sampling might have a negative effect on your application's performance.
• The origin column, which identifies whether the process was created via a fork, vfork, or exec. • The handling column, which shows whether the process was measured, tracked, or ignored. • The exit status, which is the final exit code for the process. Figure 22 (page 97) shows an example process tree report.
-p glob1[:glob2:...] Matches the executable base name of each new process against each glob pattern. A glob pattern follows the Unix shell-style rules to expand file names. If one or more of those patterns match, the process is measured. Otherwise the process is tracked. For more information, see “Using -p some ” (p. 98). If you specify multiple -p options, the last one takes precedence. Using -p some The syntax for -p some is the most complex. -p [some:][(opt1[,opt2,...
Table 6 Process Origin Options Used with -p some Option Description root Denotes the initial root process. fork Matches any process created by fork of a measured or tracked parent process. exec Matches any process created by exec of a measured or tracked process. The default is to match any process origin. If you specify multiple options, HP Caliper looks for matches for any of the options.
1. 2. Make predefined HP Caliper performance measurements using your sample data sets. Compare HP Caliper results with results from previous builds to identify performance improvements or regressions. Using HP Caliper to Generate Test Suite Reports To automatically generate HP Caliper reports during your builds on HP-UX (for example, in test suites), use entries in your makefile such as these: # # Makefile rule for generating some common performance reports.
NOTE: The target process being measured does not terminate when HP Caliper detaches from it. The process keeps running (in the background if it was launched by HP Caliper). How to Display Reference Information About CPU Counters or HP Caliper Report Types You can use the caliper info command to display reference information about the CPU counters or HP Caliper report descriptions. You can specify a partial name of a CPU counter or a measurement name to get information on all items that match.
The -c and -r options are mutually exclusive. If neither is given, then -c is assumed. The output of this option comes from two text files in the HP Caliper directory. See “Specifying Which CPU Events to Measure” (p. 93). -d or --details -d all|[name][:title][:category][:description] Specifies which information fields to include in CPU counter reports. This can be any combination of name, title, category, or description separated by colons or all. The default value is name:title.
HP Caliper Environment Variables HP Caliper uses environment variables to control certain default settings. CALIPER_DATABASES Specifies the location of the databases directory. This directory is where HP Caliper creates the output database each time a data collection run is made without the use of the -d option. (If the -d option is used and no particular directory is specified, the output database is created in the current directory.
9 Controlling the Content of Reports HP Caliper allows you to control the content of reports based on the data collected. Processor Information, Run Information, and Sampling Specifications are present by default in all collection run reports. Layout of an HP Caliper Text or CSV Report HP Caliper uses a consistent layout for the sections in all of the measurement reports produced for text or CSV output.
• • Report-specific information dependent on the measurement type: ◦ Event Counts for ecount ◦ Function Count Details for fcount ◦ Source Directory Summary, Source File Summary, and other function information for fcover ◦ Call Graph and Function Indexes for scgprof and cgprof ◦ Hot Call Paths, Call Graph, and Function Indexes for cstack Blocking Primitives Summary ◦ • Report Help ◦ • Hot Call Paths, Call Graph, and Function Indexes for cstack A description of how to get help in understandi
how much of the Function Details section and how much of the Function Summary section should be displayed in the report. Table 8 (page 107) shows the metrics you can use for the --sort-by metric option. The default value for sorting/cutoffs (if you don't use the --sort-by metric option) is also shown.
Table 8 Available Metrics for Report Sorting and Cutoffs Report Name Notes Available Metrics alat • sampled-misses (default) branch • target • branch-ways • mispredict (default) • back-end-only-mispredict • call-count cgprof (HP-UX only) • msecs-per-call • samples (default) • seconds • samples (default) cstack • samples-running (HP-UX only) • sampled-blocked (HP-UX only) • avg-latency dcache • latency (default) • sampled-misses • hpw-fills dtlb • l2-fills • sampled-misses (default) • soft-fill
Table 8 Available Metrics for Report Sorting and Cutoffs (continued) Report Name Notes Available Metrics traps Default by first trap • samples Module-Centric Reports If you use the --group-by module option, HP Caliper will produce a module-centric report. In a module-centric report, there is no data about individual processes in the collection runs. Instead, all matching modules with data that can be merged are grouped together, across processes.
5.23 4.58 3.92 5.23 9.80 13.73 8 7 6 libbfd-2.15.92.0.2.so::bfd_hash_... libbfd-2.15.92.0.2.so::bfd_hash_... libc.so.6.1::__gconv_transform_u... There is no Process Summary information (even though nine processes are measured). In the Load Module Summary, all data in libc.so from all processes is merged together and presented in a single entry in the table. In the Function Summary, functions are presented across all (merged) modules. Process Summary A Process Summary shows the hottest processes.
Function Details Each instruction bundle shown in a Function Details table consists of four rows of data. The top row for the instruction bundle shows data totals for the bundle. The remaining rows show per-instruction data. The bundles shown may or may not be contiguous. You can use the -r (--report-details) option to specify whether reports should contain function source (-rs), instructions (-ri), or both (-ra). Use the --context-lines option to control how much source or disassembly, or both to display.
Branch Targets in Disassembly Listings By default, the symbols shown for branch targets in disassembly are limited to 30 characters. You can change the limit by setting the following variable in the measurement configuration file or the .caliperinit file: disasm_target_name_limit = limit Source Position Correlation In addition to printing address and function names, HP Caliper prints source position information when it is available and appropriate.
How Functions Are Named in Reports HP Caliper attempts to print the most complete name possible for each function listed in reports. The general format for function names is: load_module_name::function_name For example: libdl.so.l::libdl_init threads::tu_thread_destroy If the load module name is implicit from the context, then HP Caliper prints only the simple function name. Consult a linker load map and disassembly listing, or both, to determine the function.
This information is reported under Processor Information. • Processor set (pset) Every application can (potentially) be run in a different processor set, which can have unique characteristics that impact performance. For each process, HP Caliper detects and reports which processor set was used. Possibilities are: ◦ Default: No processor set was specified. ◦ Kernel: A special processor set that a few kernel processes belong to. (This appears only in system-wide measurements.
How HP Caliper Saves Data in Databases HP Caliper saves performance data for every measurement run to a database. This allows you to regenerate reports from the same performance data without having to rerun your application under HP Caliper. You have these capabilities: • You can generate a new report with different attributes from the saved data. This means that you do not have to rerun HP Caliper on the live program.
By default, HP Caliper does not generate a report file when you specify -d. However, you can generate a report file at the same time by specifying -o.
Using the caliper report Command to Create a Report from One or More Databases Use caliper report to create a single output report from one or more databases. The syntax for this command is: caliper report [report_options] [database ...] You can specify multiple databases, either individually or by using wildcards. For example, to generate a report for all databases matching dbase.* to the text file out.txt, do the following: $ caliper report -o out.txt dbase.
Example 2 Example of a caliper merge Run ================================================================================ HP Caliper 4.3.
Database: /home/sujoys/db3 Measurement scope: per-process Sampling Specification Sampling event: CPU_CYCLES Sampling period: 500000 events Sampling period variation: 25000 (5.
Example 3 Example of a caliper diff Run ================================================================================ HP Caliper 4.3.
HP Caliper supports diff reports for all measurements except the ones below: • cgprof (HP-UX only) • cpu (HP-UX only) • cstack • pmu_trace • scgprof Example of How to Use the caliper diff Command Assume these two measurement runs: $ caliper fprof -d fp1 cc himom.c $ caliper fprof -d fp2 cc -c himom.
10 Producing a Sampled Call Graph Profile Analysis HP Caliper can produce a sampled call graph profile report (using the scgprof measurement) from any compiled program. You do not need to compile your program in any special way to use this feature. The call graph is produced by sampling the processor's performance monitoring unit (PMU) to determine function calls. The call graph is not exact, because it does not show every function call, but it is statistically useful. This chapter provides an overview.
Running the HP Caliper Sampled Call Graph Profile You can start HP Caliper from the command line, a shell script, or your program's Makefile to produce a sampled call graph profile. The syntax is: caliper scgprof [caliper_options] program [program_arguments] This measurement uses the --branch-sampling-spec option to control the sampling of the branch trace buffer (BTB)/execution trace buffer (ETB), which produces the statistical call graph. For more information, see “--branch-sampling-spec” (p. 54).
Figure 25 Sampled Call Graph Text Report Example ================================================================================ HP Caliper A.4.3.
-------------------------------------------------------------------------Load Module Summary ------------------------------------------------------------------------------% Total Cumulat Secs Msecs IP % of IP in Call per Samples Total Samples Module Count Call Load Module ------------------------------------------------------------------------------59.26 59.26 32 0.01 2168 0.00 libc.so.1 38.89 98.15 21 0.01 135 0.05 wordplay 1.85 100.00 1 0.00 8 0.04 dld.
9 Function Totals -----------------------------------------0 0x0000:0 M addp4 r32=r0,r32 :1 F nop.f 0x0 :2 I addp4 r33=r0,r33 0x0010:0 M nop.m 0x0 :1 F nop.f 0x0 :2 I nop.i 0x0;; 0x0020:0 M alloc r31=ar.pfs,0,8,0,8 :1 F nop.f 0x0 :2 I and r28=0x7,r33 0x0030:0 M and r30=0xfffffffffffffff8,r33 :1 I and r29=0x7,r32 :2 B brp.loop.imp {self}+0x180,{self}+0x190;; 1 0x0040:0 M cmp.eq.unc p15=r0,r32 :1 F nop.f 0x0 :2 I shl r21=r28,3 0x0050:0 M cmp.eq.unc p14,p13=r0,r33 :1 I mov r8=r32 :2 B (p15) br.ret.dpnt.
:2 ~5,0x2400:0 :1 :2 B M M I (p4) (p5) (p5) (p20) br.cond.dpnt.many {self}+0x2db0;; ld8.acq r8=[r37] ld8 r1=[r38] mov r1=r48;; ~ ~ ~ ~ ~ ~ ~ ~ 2 ~5,0x2430:0 :1 :2 ~5,0x2440:0 :1 :2 ~301 ~2,0x2450:0 :1 :2 M nop.m 0x0 B (p6) br.cond.dpnt.many {self}+0x870 B (p2) br.cond.dpnt.many {self}+0x3390;; M (p3) mov r47=1 M nop.m 0x0 B br.
--------------------------------------------------5.56 [wordplay::alphabetic, 0x4005b80, wordplay.c] 3 ~902 Function Totals -----------------------------------------[/home/meagher/wordplay.c] (0) ~902 >{ 0 ~1,0x0000:0 M alloc r43=ar.pfs,0,13,1,0 :1 M addl r38=-48,r1 :2 I mov r33=b0 (1) ~907 > for (i = 0; i < (int) strlen (s); i++) ~3,0x0010:0 M mov r42=r1 :1 M addl r9=160,r1 :2 I mov r45=r32;; ~ ~ ~ ~ ~ ~ ~ ~ ~909 > alphstr[pos++] = s[i]; ~7,0x0060:0 M adds r44=144,r44 :1 I mov b6=r8 :2 B br.call.sptk.
/ux/libsobj_i380em/libs/libc/shared_em_32/obj/../../../../../core/libs/libc/shared_em_32/../core/stdio/fgets.c] (0) 0 ~ ~ ~ ~ ~ ~ ~ ~ (1) 1 ~96 ~1,0x0000:0 :1 :2 ~1,0x0010:0 :1 :2 > ~49 ~1,0x00f0:0 :1 :2 ~1,0x0100:0 :1 :2 ~1,0x0110:0 :1 :2 *> M M I M M I M I I M I I M M B (p6) alloc mov mov adds addl addp4 r35=ar.
12.1 libc.so.1::strcpy [5] wordplay::main [2] *ROOT* [1] ---------------------------9.9 libc.so.1::strlen [3] wordplay::main [2] *ROOT* [1] ---------------------------9.2 libc.so.1::strlen [3] wordplay::alphabetic [6] wordplay::main [2] *ROOT* [1] ---------------------------8.8 libc.so.1::strlen [3] wordplay::uppercase [4] wordplay::main [2] *ROOT* [1] ---------------------------7.4 libc.so.1::toupper [8] wordplay::uppercase [4] wordplay::main [2] *ROOT* [1] ---------------------------5.
26.98 51/189 27 wordplay::extract [7] 72.49 137/189 72 wordplay::main [2] [5] 16.7 100.00 189 libc.so.1::strcpy [5] -----------------------------------------------------------------------100.00 39/39 100 wordplay::main [2] [6] 14.7 37.70 39 wordplay::alphabetic [6] 62.30 458/1478 31 libc.so.1::strlen [3] -----------------------------------------------------------------------100.00 44/44 100 wordplay::main [2] [7] 11.8 47.09 44 wordplay::extract [7] 38.12 51/189 27 libc.so.1::strcpy [5] 14.78 87/1478 6 libc.
2 19 20 21 3 4 main memmove mmap strcmp strlen uppercase 10 23 24 5 8 memccpy __milli_rem32U _mmap_sys strcpy toupper ---------------------------------------------------------------------Diagnostic Messages ---------------------------------------------------------------------+ Note: Multiple sampling counter variations are not available on HP-UX.
Diagnostic Messages The Diagnostic Messages appear at the end of the report. gprof Fallacy and Possibly Misleading Results The HP Caliper sampled call graph report (with the scgprof measurement) and the call graph report (with the cgprof measurement) both produce gprof-like reports. Thus, both these reports might produce misleading results regarding the amount of time spent under a function.
11 Producing a Sampled Call Stack Profile Analysis HP Caliper can produce a sampled call stack profile report (using the cstack measurement) from any compiled program. You do not need to compile your program in any special way to use this feature. HP Caliper periodically samples the application program counter and each of its thread's call stacks and then creates a call stack profile of the program's execution.
Figure 26 Call Stack Profile Text Report Example ================================================================================ HP Caliper A.4.4.
--------------------------------------------------------------------------------------------------------57.14 57.14 28 0 28 0 9 libpthread.so.1 40.82 97.96 20 0 20 0 0 libc.so.1 2.04 100.00 1 1 0 0 0 enh_thr_mutex1 --------------------------------------------------------------------------------------------------------100.00 100.
---------------------------------------------40.8 0.0 40.8 libc.so.1::__sigtimedwait_sys [8] libc.so.1::sigtimedwait [5] libc.so.1::_sleep [4] enh_thr_mutex1::foo [7] enh_thr_mutex1::start_routine [3] libpthread.so.1::__pthread_bound_body [2] ---------------------------------------------38.8 0.0 38.8 libpthread.so.1::___lwp_wait_sys [10] libpthread.so.1::_lwp_wait [11] libpthread.so.1::__vp_join [9] libpthread.so.1::pthread_join [12] enh_thr_mutex1::main [13] dld.
100.00 enh_thr_mutex1::start_routine [3] 0.00 libpthread.so.1::pthread_mutex_lock [16] 100.00 libpthread.so.1::*unnamed@0x404(1670-5b70)* [14] -------------------------------------------------------------------100.00 libpthread.so.1::_lwp_mutex_lock [15] [17] 18.4 0.0 18.4 100.00 libpthread.so.1::__lwp_mutex_lock_sys [17] -------------------------------------------------------------------[16] 18.4 0.0 18.
Block Hits Hits Name Hits Only Only ---------------------------------------------95.0 0.0 95.0 libpthread.so.1::___lwp_wait_sys [3] libpthread.so.1::_lwp_wait [4] libpthread.so.1::__vp_join [5] libpthread.so.1::pthread_join [6] enh_thr_mutex1::main [7] dld.so::main_opd_entry [1] ---------------------------------------------5.0 5.0 0.0 enh_thr_mutex1::main [7] dld.so::main_opd_entry [1] ---------------------------------------------[Minimum function entries: 5, percent cutoff: 1.
20.41 [(No source information) libc.so.1::__sigtimedwait_sys, 0x422ab40] 10 0 10 0 0 Function Totals ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------18.37 [(No source information) libpthread.so.
-------------------------------------------------------------------Function Indexes (Thread 6065598@start_routine) --------------------------------------------------Index Name Index Name --------------------------------------------------2 *ROOT* 6 foo 10 _lwp_mutex_lock 11 __lwp_mutex_lock_sys 3 __pthread_bound_body 8 pthread_mutex_lock 5 sigtimedwait 7 __sigtimedwait_sys 4 _sleep 1 start_routine 9 *unnamed@0x404(1670-5b70)* Load Module Summary (Thread 6065597@start_routine) --------------------------------
-------------------------------------------------------------------100.00 *ROOT* [7] [6] 100.0 0.0 100.0 0.00 libpthread.so.1::__pthread_bound_body [6] 100.00 enh_thr_mutex1::start_routine [5] -------------------------------------------------------------------[7] 100.0 0.0 100.0 0.00 *ROOT* [7] 100.00 libpthread.so.
Figure 27 Call Stack Profile Text Report Example for Linux ================================================================================ HP Caliper C.4.4.
-------------------------------------------------------------Function Summary (All Threads) ---------------------------------------------------------------------------------------% Total Cumulat WallSample IP % of Clock Hits Samples Total Samples Waiting Function File ---------------------------------------------------------------------------------------78.05 78.05 32 1 *kernel gateway* 21.95 100.00 9 0 enh_thr_mutex1::main enh_thr_mutex1.
[1] 100.0 0.00 *ROOT* [1] 51.22 libc.so.6.1::__clone2 [5] 48.78 enh_thr_mutex1::_start [8] -----------------------------------------------62.50 libc.so.6.1::__GC___libc_nanosleep [9] 34.38 libpthread.so.0::pthread_join [12] 3.12 libpthread.so.0::__lll_lock_wait [13] [2] 78.0 100.00 *kernel gateway* [2] -----------------------------------------------100.00 libpthread.so.0::start_thread [4] [3] 51.2 0.00 enh_thr_mutex1::start_routine [3] 95.24 enh_thr_mutex1::foo [11] 4.76 libpthread.so.
Function Details (Thread 31021@main) -----------------------------------------------------------------% Total WallSample Line| IP Clock Hits Slot| >Statement| Samples Samples Waiting Col,Offset Instruction -----------------------------------------------------------------26.83 [(No source information) *kernel gateway*, 0xa000000000000000] 11 0 Function Totals --------------------------------------------------------------------------------------------------------------------------21.
26.83 26.83 11 1 *kernel gateway* -------------------------------------------------------------26.83 26.83 11 1 Total -------------------------------------------------------------Function Summary (Thread 31024@start_routine) ---------------------------------------------------------------------------------------% Total Cumulat WallSample IP % of Clock Hits Samples Total Samples Waiting Function File ---------------------------------------------------------------------------------------26.83 26.
100.00 libc.so.6.1::__GC___libc_nanosleep [6] -----------------------------------------------100.00 enh_thr_mutex1::start_routine [2] [8] 90.9 0.00 enh_thr_mutex1::foo [8] 100.00 libc.so.6.1::sleep [7] -----------------------------------------------100.00 libpthread.so.0::pthread_mutex_lock [10] [9] 9.1 0.00 libpthread.so.0::__lll_lock_wait [9] 100.00 *kernel gateway* [1] -----------------------------------------------100.00 enh_thr_mutex1::start_routine [2] [10] 9.1 0.00 libpthread.so.
[4] 100.0 0.00 enh_thr_mutex1::foo [4] 100.00 libc.so.6.1::sleep [3] -----------------------------------------------100.00 libpthread.so.0::start_thread [6] [5] 100.0 0.00 enh_thr_mutex1::start_routine [5] 100.00 enh_thr_mutex1::foo [4] -----------------------------------------------100.00 libc.so.6.1::__clone2 [7] [6] 100.0 0.00 libpthread.so.0::start_thread [6] 100.00 enh_thr_mutex1::start_routine [5] -----------------------------------------------100.00 *ROOT* [8] [7] 100.0 0.00 libc.so.6.
Example 4 Sample cstack Report - Blocking Primitives Details Blocking Primitives Details (All Threads) -----------------------------------------------------------------------------------------------Sample Callpath Holder's % Total Sample Sample Sample Hits Index Kernel Hits Hits Hits Hits Blocking For Holder Holder Thread Waiting Waiting Spinning Blocked Primitive --For Waiter --Waiter ID -----------------------------------------------------------------------------------------------20.
Call Graph Part of the Report This section reports the call graph produced from the call stack samples. All the call graph entries—one for each function—are reported. Each entry has one or more lines and delimited by the line full of dashes. In each entry, the primary line is the one that starts with an index number in square brackets. The preceding lines in the entry describe the callers of this function. The lines following the primary line describe the callees of this function.
libc.so.1::sigwait caliper::signal_monitor_thread_main libpthread.so.1::__pthread_bound_body ------------------------------------------------------------Hot Call Paths (Thread 859900@timers_thread_main) --------------------------------------------------------------% Total Hits In Only-Run + Run Block Index Block Hits Hits Name Hits Only Only ------------------------------------------------------------0 100.0 0.0 100.0 libc.so.1::_nanosleep_sys libc.so.
12 Performing CPU Metrics Analysis HP Caliper can measure and report per-process or system-wide metrics based on sampled CPU events. This is enabled by the cpu measurement. Specify the events and sampling period with the -m event_set and -s period options, respectively. You can measure multiple metrics in the same run. For most applications, the cpu measurement is the first measurement you should take when you begin using HP Caliper. Run this command: $ caliper cpu -o cpu.
13 HP Caliper Features Specific to HP-UX These features are available only when using HP Caliper on the HP-UX operating system: • These measurements: ◦ cgprof ◦ cpu See “Performing CPU Metrics Analysis • ◦ fcount ◦ fcover ” (p. 152). These command-line options: ◦ --bus-speed See “--bus-speed ◦ ” (p. 55). --cpu-aggregation See “--cpu-aggregation ◦ --cpu-details See “--cpu-details ◦ ” (p. 61). --exclude-idle See “--exclude-idle ◦ ” (p. 56).
Use of this option causes two different sets of memory measurements to be taken, each reported in its own table in the report: • Overall memory available (and currently in use and free) on the system • Memory currently being consumed by the process(es) being measured by a particular HP Caliper run If the HP Caliper run is made on a ccNUMA system, then the memory usage of every “logical domain” is separately measured and reported.
• --memory-usage=all:30m Causes process memory usage to be measured at the beginning, at the end, and every 30 minutes of the process's execution. • --memory-usage=end Causes process memory usage to be measured once: at the end of the process's execution. • --memory-usage= Causes process memory usage to be measured at the beginning, at the end, and every 1 second of the process's execution. (This is equivalent to --memory-usage all.
On an SMP system, there is only one entry for the single, local domain. No total is necessary for SMP systems. The fields in each entry are: Domain Id System identification number of the logical domain. On ccNUMA systems, cell local memory domains are numbered starting at 1, and the interleaved memory domain Id is −1. On SMP systems, the only domain is numbered 0. Physical Id Physical identification number of the logical domain. Cell local memory physical domains are numbered starting with 0.
The time value will generally be a multiple of the requested sample period (--memory-usage=nnn), which defaults to 1 second. “Gaps” in the time sequence of snapshots indicate a stretch of time where the process's memory usage did not change. Domain Id System identification number of the logical domain. On ccNUMA systems, cell local memory domains are numbered starting at 1, and the interleaved memory domain Id is –1. On SMP systems, the only domain is numbered 0.
Figure 29 Example System Usage Report Output System Usage - Run Status (All Threads) -------------------------------------------------------------------------------Relative -------- Time (thread secs) -------------- Percentage -------Time Running Eligible Waiting Running Eligible Waiting -------------------------------------------------------------------------------Overall 5.4534 0.0060 18.3617 22.89% 0.03% 77.
sigenable 132 1627.70 0.00000 0.00000 0.00000 0.00012 pstat 1 49.32 0.00010 0.00010 0.00010 0.00010 lwp_cond_broadcast 6 73.99 0.00000 0.00001 0.00006 0.00008 ttrace 1 49.32 0.00007 0.00007 0.00007 0.00007 open 6 295.95 0.00001 0.00001 0.00002 0.00007 ioctl 1 49.32 0.00004 0.00004 0.00004 0.00004 shmctl 2 98.65 0.00000 0.00002 0.00004 0.00004 brk 15 184.97 0.00000 0.00000 0.00000 0.00003 mpctl 16 789.19 0.00000 0.00000 0.00000 0.00003 sigaction 22 1085.13 0.00000 0.00000 0.00001 0.00003 close 10 493.24 0.
2. 3. Run ./myprog and find the process ID of the process. Specify the process you want to measure. For example: $ caliper fprof 7654 HP Caliper remains attached to the target process until it ends or you type Ctrl-C. If you type Ctrl-C to stop HP Caliper and generate a report, HP Caliper forcibly terminates all processes that are being measured.
sampling_counter = “NO_EVENT” If you don't change this setting, then the samples you have marked will be included with whatever sampling results HP Caliper is set to generate. You can instead run HP Caliper, specifying -s ,,NO_EVENT or -s "" on the command line. 5. Run your application under HP Caliper using that modified measurement configuration file: $ caliper my_pmu_trace myprogram Figure 31 (page 161) shows part of the resulting report.
While executing those instructions will not cause an application to crash in the absence of HP Caliper, they will still have an impact on performance. Executing a break instruction causes a trap to the breakpoint handler in the kernel. • The presence of trigger macros may disable some optimization that the compiler could perform. The trigger instructions are defined so that code will not be moved around them.
for a shorter time, without having to worry about the effects caused by the startup and shutdown code. Similarly, the data collection can be restricted to the startup or shutdown phases to target those for performance improvements. NOTE: This feature is not intended to measure a small number of instructions. Enabling and disabling the PMU are not immediate operations and either operation might take a few processor cycles to be effective.
Figure 32 Restricting PMU Measurement to Specific Code #include #include
A HP Caliper Diagnostic and Warning Messages This appendix describes some diagnostic and warning messages you might receive. HP Caliper always attempts to measure everything that you request. When this is not possible, however, HP Caliper gives you diagnostic or warning messages. You can usually safely ignore these messages. Several situations can cause these messages: • A sampled address is outside the measurement context. • A function contains specialized assembly code.
Figure 33 Mispredicted Branches Example Function Details ---------------------------------------------------------------------------------------------% Total Target Line| Taken of Branch Branch Taken NTaken % Slot| >Statement| Mispr Branch Taken NTaken Mispr Mispr Mispr Col,Offset Instruction ---------------------------------------------------------------------------------------------25.00 [libc.so.1::__thread_mutex_lock, 0x40000000002123a0, wrappers1.c] 2 2 0 1 0 50.
---------------------------------------------------------------------------------------------[Minimum function entries: 0, percent cutoff: 1.00, cumulative percent cutoff: 100.00] By using a custom HP Caliper script, you can restrict the branch-trace buffer to only include branches with specific prediction results, both for target prediction and taken/not-taken prediction.
On HP-UX, sampled call graph reports require kernel patch PHKL_34020. To install this patch, see http://www.hp.com/go/hpscfor availability and download information. Email the HP Caliper team at caliper-help@cup.hp.com if you have questions about this patch.
B Descriptions of Measurement Reports This appendix contains descriptions of reports produced for each HP Caliper measurement. It shows example command lines you can use to produce the reports and describes the data available with the measurements.
Metrics for Integrity Servers Itanium 2 Systems INST_CHKA_LDC_ALAT.ALL INST_FAILED_CHKA_LDC_ALAT.ALL ALAT_CAPACITY_MISS.ALL Data speculation miss percentage The number of advance check load (chk.a) and check load (ld.c) instructions that reached retirement, including both integer and floating-point instructions. The number of failed advance check load(chk.a) and check load (ld.c) instructions that reached retirement, including both integer and floating-point instructions.
INST_CHKA_LDC_ALAT.INT The number of all advanced check load (chk.a) and check load (ld.c) instructions that reach retirement. Counts only retired integer instructions. INST_FAILED_CHKA_LDC_ALAT.FP The number of failed advanced check load (chk.a) and check load (ld.c) instructions that reach retirement. Counts only failed floating-point instructions. INST_FAILED_CHKA_LDC_ALAT.INT The number of failed advanced check load (chk.a) and check load (ld.c) instructions that reach retirement.
ALAT_ENTRY_REPLACED ALAT_STORE_HIT % Data speculation miss % Failed float advanced check load % Failed integer advanced check load Number of ALAT entries replaced Advanced check load per 1000 instructions retired Failed advanced check load per 1000 instructions retired % Cycles lost due to stalls (lower is better) % Core cycles due to this thread Number of Number of Percentage Percentage Percentage ALAT entry replaced. ALAT store hit . of data speculation miss. of failed float advanced check load.
Table 9 Information in alat Measurement Reports (continued) Column Description Column and line numbers are preceded by “~” when they are approximate due to optimization. >Statement | Instruction The column contains either a source statement, preceded by “>”, or a disassembled instruction. Statements that are out of order due to optimization are preceded by “*>”. How ALAT Metrics Are Obtained HP Caliper obtains ALAT metrics from the processor's performance monitoring unit (PMU).
• BR_MISPRED_DETAIL.ALL.WRONG_TARGET Number of branch mispredictions that resulted from a mismatch of the predicted and actual values of the branch target, independent of predictor. • Total Predictions Total number of branch predictions. • Percent Correct Predictions Percentage of branch predictions that predicted correctly. • Percent Wrong Paths Percentage of branch predictions that mispredicted the branch predicate.
• Percent Wrong Paths Percentage of branch predictions that mispredicted the branch predicate. • Percent Wrong Branch Targets Percentage of branch predictions that mispredicted the branch target. • Percent iprel branch Percentage of IP-relative branches among all branches. • Percent ind branch Percentage of non-return indirect branches among all branches. • Percent ret branch Percentage of return branches among all branches.
• BR_PRED_DETAIL.NON_RETIND_CORR_PRED Number of non-return indirect branch types with correctly predicted path and target • BR_PRED_DETAIL.RETURN_WRONG_PATH Number of return branch types with mispredicted path. • BR_PRED_DETAIL.NON_RETIND_WRONG_TARGET Number of non-return indirect branch types with mispredicted target but, correctly predicted path. • BR_PRED_DETAIL.RETURN_CORR_PRED Number of return branch types with correctly predicted path and target. • BR_PRED_DETAIL.
• Percent ret correct predictions Percentage of return branches that predicted correctly. • Percent ret wrong paths Percentage of return branches that mispredicted the branch predicate. • Percent ret wrong branch targets Percentage of return branches that mispredicted the branch target. branch Measurement Report Metrics See Table 10 (page 177).
Table 10 Information in branch Measurement Reports (continued) Column Description Line | Slot | Col,Offset The column contains one of these: • A source-code line number for rows showing statements • An instruction slot number for rows showing instructions not on a bundle boundary • A source-code column number followed by an offset from the beginning address of a function for rows showing instructions on a bundle boundary Column and line numbers are preceded by “~” when they are approximate due to optimiz
The times cgprof reports are affected by the instrumentation done to collect the data. The times are inaccurate in absolute terms but valid in relative terms. HP Caliper accounts for 100 percent of time under main for all applications with 0 (zero) unattributed samples.
Table 12 (page 180), Table 13 (page 180), Table 14 (page 180), and Table 15 (page 181) show more information. Table 12 Information in cgprof Measurement Report: Function Entries (Self Entries) Column Description Index Index of the function in the call graph listing, as an aid to locating it. % Total Hits In or Under Percentage of the total hits of the program accounted for by this function and its descendants.
Table 15 Information in cgprof Measurement Report: Children Listings Column Description % Func Hits In Children Number of hits due to this child entry and its descendants, expressed as a percentage of the number of hits accounted for by the self entry and its descendants. Called* Number of times the function represented by the self entry was called by this child. /Total** Number of times this child is called by all functions. % Call Total Fraction of Called/Total expressed as a percentage.
If you set the --cpu-details means option, only the MEAN and STDEV statistics are reported. This is the default. • A list of samples, with each sample showing both sampled values and statistics derived from those values. To report samples (not shown by default), specify --cpu-details means:samples or --cpu-details statistics:samples. Example Command Lines for Text Report $ caliper cpu -o cpu.
If you specify this event set and you are on an Integrity servers dual-core Itanium 2 or Itanium 9300 quad-core processor system, it is treated as if you specified -m l2dcache,l2icache. l2dcache l2icache l3cache overview Provides miss rate information for the L2 data cache for Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processor systems. Provides miss rate information for the L2 instruction cache for Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processor systems.
• l3cache • tlb • fp • replay This is the same as specifying -m stall,cpi,l1icache,l1dcache,l2icache,l2dcache,l3cache,tlb,fp,replay queues stall sysbus threadswitch tlb Provides bus request queue (BRQ) metrics that may give some insight into possible performance problems related to the system bus. Provides metrics on primary CPU performance limiters by breaking the CPI into seven components. Provides metrics on system bus utilization.
Table 16 Information in cstack Measurement Report Fields (Flat Profile) (continued) Column Description Sample Hits Running Number of direct sample hits taken when the process was running, attributed to the given object.
Table 18 Information in cstack Measurement Report Fields (Hot Call Paths Profile) Column Description % Run + Block Hits (HP-UX only) Percentage of total sample hits directly in the call path. This represents the percentage of the total real time attributable to the call path. % Run Hits Only (HP-UX only) Percentage of run hits directly in the call path. This represents the percentage of the total real time attributable to the call path that was in a run state.
The report shows two levels of information: • Exact counts of CPU metrics summed across the entire run of an application • Sampled IPs that are associated with particular locations in the measured application. When compared with the fprof measurement, the cycles measurement provides the following two additional pieces of information when invoked with the -r all option: • Cycles Per Bundle: The average number of cycles elapsed to retire the bundle.
BE_RSE_BUBBLE.ALL CPU_CPL_CHANGES.ALL CPU_OP_CYCLES.ALL Full Pipe Bubbles in Main Pipe due to RSE stalls. Percentage of cycles lost due to stalls in RSE spilling/filling registers to/from memory. Number of Privilege Level Changes to/from all privileges. Number of elapsed CPU operating cycles. (Note: This event is called CPU_CYCLES on Itanium 2 systems.) When HyperThreading is on, this is the number of elapsed CPU operating cycles used by only this process's hyperthread. CPU_OP_CYCLES.
CYC_BE_WB2_FLUSH.ANY The number of CPU cycles spent in WB2 (Write back) flushing of instructions. CYC_BE_IBD_STALL.ANY The number of CPU cycles spent in the IBD(instruction buffer and dispersal) without issuing instructions. CYC_BE_IBD_STALL.GR_LOAD This is the number of cycles lost (stall cycles) due to GR load RAW or WAW dependency condition of the instruction. CYC_BE_EXE_REPLAY.GR_LOAD_RAW This is the number of cycles lost (stall cycles) in replay due to RAW hazard in an instruction's GR load.
cycles Measurement Metrics See Table 20 (page 190). In this table, “program object” refers to any of the following: • Thread • Load module • Function • Source statement • Instruction bundle Table 20 Information in cycles Measurement Reports Column Description % Total IP Samples (ETB) Percent of the total IP samples attributable to a given program object. Cumulat % of Total Running sum of the percent of total IP samples accounted for by the given program object and those listed above it.
The list of processor metrics you can use for the sampling event are available from the file itanium2_cpu_counters.txt, located in the HP Caliper home directory in the doc/text subdirectory. The ETB collected at each sampling point can contain up to 16 IPs. By default, cycles will pick the youngest IP sample from the ETB. However, all the 16 IP entries are processed to collect the elapsed cycles (Cycles Per Bundle) information.
dcache Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper. Metrics for Integrity Servers Itanium 2 Systems L1D_READS The number of data memory read references issued into memory pipeline that are serviced by the L1 data cache (only integer loads), register stack engine (RSE) loads, L1-hinted loads (L1 data cache returns data if it hits in L1 data cache but does not do a fill) and check loads (ld.c).
L1D_READ_MISSES.ALL L2D_INSERT_MISSES L2D_MISSES L2D_REFERENCES.
CYC_BE_EXE_REPLAY.GR_LOAD_WAW This is the number of cycles lost (stall cycles) in replay due to WAW hazard in an instruction's GR load. CYC_BE_DET_REPLAY.GR_LOAD This is the number of cycles lost (stall cycles) in replay due to memory loads of single cycle GR load instructions. The loads do not hit the FLD (first level data cache) and must be obtained from lower level caches or memory leading to extra cycles. DATA_REF.ANY The number of data memory references issued into memory pipeline.
• Source statement • Instruction Table 21 Information in dcache Measurement Reports Column Description % Total Dcache Latency Cycles Total cache miss latency cycles, expressed as a percent of the total cycles. Sampled Dcache Hits Total number of sampled L1 (or FLD) data cache accesses attributed to the given program object. Enabled only on Intel® Itanium® 9500 series processors.
Table 21 Information in dcache Measurement Reports (continued) Column Description Line | Slot | Col,Offset The column contains one of these: • A source-code line number for rows showing statements • An instruction slot number for rows showing instructions not on a bundle boundary • A source-code column number followed by an offset from the beginning address of a function for rows showing instructions on a bundle boundary Column and line numbers are preceded by “~” when they are approximate due to optimiz
Example 6 Example of a dcache Report for a Superdome Integrity Server Function Details --------------------------------------------------------------------------------------------------% Total Avg. ---Latency buckets as % Misses--Dcache Sampled Dcache Dcache L2 --L3-- loc loc 1 2 1&2 Line| Latency Dcache Latency Laten.
• RSE Stack - the RSE stack area • Memory mapped shared library - the data area of the shared libraries mapped to the process • Memory mapped region - all other memory mapped regions If there is more than one region of the same type, they are combined and reported as a single entry. The Data Summary report is generated per-process. For a per-thread report, use the --thread all option. For a per-module report, use the --per-module-data True option.
You can potentially get a rough estimate of the total number of data cache misses incurred by a particular instruction, for example, by doing the following: 1. Determine a scaling factor based on total misses and number of misses accounted for by sampling: scale = total L1 misses / (total sampled misses * sampling rate) 2.
dtlb Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper.
IA64_INST_RETIRED L1DTLB_TRANSFER L1D_READS L2DTLB_MISSES % of Cycles lost due to all stalls (lower is better) % of Cycles lost due to GR/load dependency stalls (lower is better) % of Cycles lost due to GR/GR dependency stalls (lower is better) % of Cycles lost due to FR/load and FR/FR dependency stalls (lower is better) Total L1 data TLB references L1 data TLB for L1D miss percentage L2 data TLB misses L2 data TLB miss percentage Percentage of L2 DTLB misses covered by the HPW Percentage of data referenc
CYC_BE_IBD_STALL.GR_LOAD Number of Backend IBD bubbles due to GR load RAW or WAW condition; starts after an EXE replay or DET replay. DTLB_HPWREQ_BLK_MISS.FAIL Number of Blocking walk missed the DTB, HPW walk failed. CYC_BE_EXE_REPLAY.GR_LOAD_RAW Number of Backend EXE replay cycles due to GR load RAW; a new instruction has a source register targeted by an outstanding load, or outstanding long latency move, or TLB related operation. IA64_INST_RETIRED Number of instructions retired. CYC_BE_EXE_REPLAY.
• Source statement • Instruction Table 22 Information in dtlb Measurement Reports Column Description % Total Percent of the total for attributable to a given program object. The is the same as the HP Caliper uses for sorting, except when the sort metric is address, in which case sampled misses is used. Cumulat % of Total Running sum of the percent of total for accounted for by the given program object and those listed above it.
More frequent sampling increases HP Caliper's perturbation of your application. In the extreme case of taking one sample for each TLB miss event, the kernel will trap on every event, making the resulting data of limited value. ecount Measurement Report Description With the ecount measurement, produced by the ecount measurement configuration file, HP Caliper measures and reports total counts of processor metrics accumulated during an application's execution under HP Caliper control.
• BE_FLUSH_BUBBLE.ALL — The number of Full Pipe Bubbles in Main Pipe due to pipeline flushes. This is the number of cycles lost (stall cycles) due to branch misprediction or exception/interruption flush. • BE_L1D_FPU_BUBBLE.L1D — The number of Full Pipe Bubbles in Main Pipe due to L1D cache. This is the number of cycles lost (stall cycles) due to L1D cache and L1/L2 DTLB. • CPU_OP_CYCLES.ALL — The number of elapsed CPU operating cycles.
• MT_BE_THRSW_ACTUAL_OUT.ANY — The number of events that switched the foreground thread into a background thread (also called as switching out) or background thread switching in. • MT_BE_THRSW_ACTUAL_OUT.MLD_USE — The number of thread switches from foreground to background due to wait on middle level data cache (MLD). • CPU_OP_CYCLES.ALL — The number of elapsed CPU operating cycles.
Command-line options allow you to control how the report data are sorted. Example Command Line for Text Report $ caliper fcount -o reports/fcount.txt /wordplay thequickbrownfox Example Command Line for CSV Report $ caliper fcount --csv reports/csvout ./wordplay thequickbrownfox fcount Measurement Report Metrics Table 23 (page 207) shows the information found in Function Call Count reports.
Table 25 Information in Per-Source-File fcover Measurement Reports Column Description Reached “Yes” or “No” indication of whether the named function was ever executed. Function Name of the function. The load module, main executable or shared library, containing the function precedes the function name and is separated from it by “::”. Source File [Line] Base source file name, without path information, and starting line number of the named function.
Example Command Line for CSV Report $ caliper fprof --csv csvout ./wordplay thequickbrownfox fprof Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper. Metrics for Integrity Servers Itanium 2 Systems CPU_CYCLES BACK_END_BUBBLE.ALL BE_EXE_BUBBLE.GRALL % of Cycles Lost Due to Stalls % of Cycles Stalled Due to GR/GR or GR/Load Dependency Number of elapsed processor cycles. Full pipe bubbles in main pipe.
% of Cycles lost due to frontend stalls (lower is better) % of Cycles lost due to Pipeline flush stalls (lower is better) % of Cycles lost due to data access stalls (lower is better) % of Cycles lost due to RSE stalls (lower is better) % of Cycles lost due to Scoreboard stalls (lower is better) % of Cycles lost due to register load stalls (includes FR/FR stalls) % of Cycles lost due to FR/load or FR/FR dependency stalls % of Cycles lost due to GR/load dependency stalls % of Cycles lost due to stalls in L1D
CYC_BE_DET_REPLAY.ANY CYC_BE_EXE_REPLAY.ANY CYC_BE_WB2_REPLAY.
Table 26 Information in fprof Measurement Reports Column Description % Total IP Samples Percent of the total IP samples attributable to a given program object. Cumulat % of Total Running sum of the percent of total IP samples accounted for by the given program object and those listed above it. IP Samples Total number of IP samples attributed to the given program object.
icache Measurement Report Description With the icache measurement, produced by the icache measurement configuration file, HP Caliper measures and reports on instruction cache metrics. This measurement is similar to the dcache measurement.
Total L1 Instruction Cache References Sum of demand fetch reads and L1 cache line prefetch requests. Metrics for Integrity Servers Dual-Core Itanium 2 and Itanium 9300 Quad-Core Processor Systems BACK_END_BUBBLE.ALL Full Pipe Bubbles in Main Pipe due to all causes. This is the number of cycles lost (stall cycles) due to any of five possible events (FPU/L1D, RSE, EXE, branch/exception, or the front-end). BACK_END_BUBBLE.FE Full Pipe Bubbles in Main Pipe due to frontend.
L1 instruction cache misses per 1000 instructions retired L1 instruction prefetch misses per 1000 instructions retired L1 instruction demand misses per 1000 instructions retired L2 instruction cache misses per 1000 instructions retired L2 instruction prefetch misses per 1000 instructions retired L2 instruction demand misses per 1000 instructions retired Number of instructions retired per L1 instruction cache miss. Number of instructions retired per L1 instruction prefetch miss.
% L2 instruction cache miss L1 instruction cache misses per 1000 instructions retired L1 instruction prefetch misses per 1000 instructions retired L1 instruction demand misses per 1000 instructions retired L2 instruction cache misses per 1000 instructions retired L2 instruction prefetch misses per 1000 instructions retired L2 instruction demand misses per 1000 instructions retired Percentage of MLI cache misses. Number of FLI cache misses per 1000 instructions retired.
Table 27 Information in icache Measurement Reports (continued) Column Description Line | Slot | Col,Offset The column contains one of these: • A source-code line number for rows showing statements • An instruction slot number for rows showing instructions not on a bundle boundary • A source-code column number followed by an offset from the beginning address of a function for rows showing instructions on a bundle boundary Column and line numbers are preceded by “~” when they are approximate due to optimiz
Example Command Line for Text Report $ caliper itlb -o reports/itlbm.txt ./matmul Example Command Line for CSV Report $ caliper itlb --csv csvout ./matmul itlb Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper. Metrics for Integrity Servers Itanium 2 Systems L1I_READS ITLB_MISSES_FETCH.L1ITLB ITLB_MISSES_FETCH.
L1ITLB_INSERTS_HPW L1I_READS % of Cycles lost due to all stalls (lower is better) % of Cycles lost due to frontend stalls (ICACHE, ITLB, and branch execution) % of Cycles lost due to instruction TLB stalls % of Cycles lost due to instruction cache stalls % of Cycles lost due to instruction access stalls (ICACHE and ITLB) % of Cycles lost due to branch execution Total L1 instruction TLB references L1 instruction TLB miss percentage L2 instruction TLB misses Percentage of L2 ITLB misses covered by the HPW L1
FLITLB_INSERT_HPW % of Cycles lost due to all stalls (lower is better) % of Cycles lost due to frontend stalls (ICACHE, ITLB, and branch execution) % of Cycles lost due to instruction TLB stalls % of Cycles lost due to instruction cache stalls % of Cycles lost due to instruction access stalls (ICACHE and ITLB) % of Cycles lost due to branch execution Total L1 instruction TLB references L1 instruction TLB miss percentage L2 instruction TLB misses L1 ITLB miss per 1000 instructions retired L2 ITLB miss per 1
Table 28 Information in itlb Measurement Reports (continued) Column Description % ITLB L2 Fill Percent of sampled instruction TLB misses that hit the L2 instruction TLB for the given program object. L2 fills are not reported for, and do not apply to, Itanium systems. % ITLB HPW Fill Percent of sampled instruction TLB misses that were handled by the HPW for the given program object. % ITLB Soft Fill Percent of sampled instruction TLB misses that were handled by software for the given program object.
pmu_trace Measurement Report Description With the pmu_trace measurement, produced by the pmu_trace measurement configuration file, HP Caliper measures traces of sampled PMU data associated with the application for each kernel thread. This data includes cache misses, TLB misses, ALAT misses, branch mispredictions, instruction addresses, and CPU events. These metrics are sampled using the processor's performance monitoring unit (PMU).
Table 29 Information in scgprof Measurement Report Fields (Flat Profile) Column Description % Total IP Samples Percent of the total IP samples attributable to a given program object. Cumulat % of Total Running sum of the percent of total IP samples accounted for by the given program object and those listed above it. IP Samples Total number of IP samples attributed to the given program object.
Table 30 Information in scgprof Measurement Report: Function Entries (Self Entries) (continued) Column Description % Call Total Does not apply. See Table 30 (page 223), Table 32 (page 224) and Table 33 (page 224). Function Name of the function. Cycle Cycle that this function is a member of, if any. Table 31 Information in scgprof Measurement Report (Hot Call Paths Profile) Column Description Total Hits In Only Percentage of total sample hits directly in the call path.
Table 33 Information in scgprof Measurement Report: Children Listings (continued) Column Description Children Name of this child function. Cycle Cycle that this child is a member of, if any. The cycle as a whole is listed with the same fields as a function entry. Beneath it are listed the members of the cycle, and their contributions to the time and call counts of the cycle.
Unimplemented data address fault Illegal operation fault Illegal dependency fault Privileged operation fault Reserved register/field fault IA32EXP IA32 Exception IACCS Instruction access bit fault IARGHT Instruction access rights fault IKEY Instruction key miss fault INT External interrupt ITLB Instruction translation lookaside buffer fault KPERM Key permission fault LPTRP Lower Privilege Transfer Trap or Unimplemented Instruction Address Trap NATC NAT Consumption fault PNotP Page Not Present fault SPECOP S
BE_EXE_BUBBLE.GRGR BE_FLUSH_BUBBLE.ALL BE_L1D_FPU_BUBBLE.ALL BE_L1D_FPU_BUBBLE.L1D BE_RSE_BUBBLE.ALL CPU_OP_CYCLES.ALL cycles) due to general register/general register or general register/load dependency. Full Pipe Bubbles in Main Pipe due to GR/GR dependency stalls. This is the number of cycles lost (stall cycles) due to GR/GR dependency stalls. Full Pipe Bubbles in Main Pipe due to pipeline flushes.
% of Cycles lost due to register dependency stalls (excludes FR/FR stalls) % of Cycles lost due to GR/GR dependency stalls % of Cycles lost due to FPU (floating-point unit) stalls % Core cycles due to this thread Percentage of cycles lost due to register dependency stalls. It exclude FR/FR dependency stalls. Percentage of cycles lost due to GR/GR dependency stalls. Percentage of cycles lost due to floating-point unit stalls.
How traps Metrics Are Obtained HP Caliper obtains traps metrics using the execution trace buffer (ETB) of the performance monitoring unit (PMU). The ETB is configured to capture all changes to/from privilege level 0. HP Caliper takes samples by using the overflow of one of the PMU's event counters as a sampling trigger.
C Event Set Descriptions for CPU Metrics This appendix contains descriptions for the output of each event set available when you use the cpu measurement. NOTE: The information provided in this appendix for each report description is the same information you receive when you use the --info option to append help to the end of text reports, or when you use this command: $ caliper info -r event-set For more information, see “cpu Measurement Report Description ” (p. 181).
• ◦ %Indirect Branch ◦ This metric provides the percentage of Indirect branches among all branches. ◦ %Return Branch This metric provides the percentage of Return branches among all branches. IPREL Path Statistics This metric provides path distribution and mispredict rate for both paths of a non-call IPREL branch. Unconditional IPREL branches are included, so there is a slight bias toward the taken path.
The metrics are: • • • Overall This metric provides the branch prediction outcome breakdown (correct, wrong path, wrong target) for all branches irrespective of the branch type. Predicated off branches that were predicted as taken will be counted as wrong path branch outcomes. ◦ Correct Percentage of correctly predicted branches (all types). ◦ Wrong Path Percentage of branches (all types) for which the target path (taken/not-taken) was predicted incorrectly.
• ◦ Wrong Path Percentage of Indirect branches for which the target path (taken/not-taken) was predicted incorrectly. ◦ Wrong Target Percentage of Indirect branches for which the target address was predicted incorrectly. Return This metric provides the branch prediction outcome breakdown (correct, wrong path, wrong target) for return branches. Predicated off returns that are predicted as taken will be counted as wrong path outcomes. ◦ Weight Fraction of Return branches amongst all branch types.
• Avg Snoop Requests This is the average number of live snoop responses that reside in the snoop request queue per cycle. • C2C/Snoop This is the fraction of snoops that local processor detects that it has a modified version of the data the a remote processor has requested as a result of a data cache miss. It does not include implicit writebacks as a result of a modified hit on a line that is being flushed in response to an fc instruction.
Metrics Available from this Measurement The following metrics are available from this event set. These descriptions do not take into account any command-line options you might use. The metrics are: • Cycles This is the total number of CPU cycles collected during the measurement sample period. • IA64 Instr This is the total number of IA64 instructions retired during the measurement sample period.
misses and excessive speculation control and data speculation fails. An estimate of any bias introduced by these events can be developed from information available in the tlb, cspec, and dspec event sets. cpubus Event Set Available only on Itanium 2 and dual-core Itanium 2 systems.
• Snoops This is the total number of snoops per second that the local processor observes as a result of data cache misses of remote processors and local processor self snoops. • Hitm (hit a modified) This is the number of implicit writebacks sourced by the local processor in response to data misses by remote processors referencing a line that is modified in the local processors cache. cspec Event Set The cspec event set provides information on the effectiveness of control speculation.
• Chks Failed This is the total number of failed chk.s instructions that were retired during the sample interval. • Control Speculation: ◦ Spec/Sec: Total This is the total number of control speculation events per second. ◦ Spec/Sec: Fail This is the number of control speculation fail events per second. ◦ Spec/Kinst: Total This is the total number of control speculation events per 1000 retired instructions. The instruction count includes predicated off and nop instructions.
• Explicit - Instructions not dispersed This is a count of the number of instructions that were not dispersed due to explicit stop bits. Explicit stop bits are used to separate bundles (three instructions) within a bundle group (two bundles of three instructions each) or to separate bundle groups. Explicit stops bits can also be found within bundle-specific templates that contain embedded stop bits, that is, M_II. The default mode will include all dispersal cycles.
by using the command-line option --exclude-idle True (which is the default). The effects of failed speculative operations and TLB misses cannot be directly eliminated, but you can get an estimate of the impact of events from the cspec, dspec, and tlb event sets. You can use the cpi event set to obtain the fraction of all instructions retired that have an architecturally visible result, except for predicated off branches, which are counted as useful instructions (non-taken branch) by the Itanium 2 PMU.
metric will be close to zero. High values would tend to suggest that the PBO information, used by the optimizer when creating the binary code, might have been invalid. • %ALAT Miss This is the percentage of the number of times that the ALAT does not have any information regarding a memory address (misses) out of the total number of times the ALAT is accessed. Instructions that access the ALAT include ld.a, ld.sa, ldf.a, ldf.sa, and ld.c.nc.
FPMIN FPMAX FPAMIN FPAMAX FPCMP FPCVT.
• FP Events/Sec: SIR Event: trap This is the total number of SIR true stalls (SWFA trap taken) observed per second. • FP Events/Fop: zero flush This is the number of flush to zero events that occur per floating-point operation (not per instruction). • FP Events/Fop: SIR Event: total This is the ratio of all SIR stalls and total floating-point operations (not instructions). The SIR count includes both false (stall only, no trap taken) stalls and true (SWFA trap taken) stalls.
• RSE - Misses per Sec This is the number of RSE load L1D cache misses per second. • Total - Misses per Kinst This is the total number of L1D cache misses per 1000 retired instructions retired, including nops, predicated off instructions, and speculative instructions/associated recovery code.
The event per kinst (event per 1000 instructions) metrics are computed using all instructions retired. This includes nops, predicated off instructions, failed speculation and instructions and associated recovery code as well as the architecturally visible instruction. You can eliminate idle loops effects by using the command-line option --exclude-idle True (which is the default).
• %Miss - All This is the percentage of the total misses (instruction demand fetch misses and instruction prefetch misses) out of the total number of L1 instruction accesses (instruction demand fetch and instruction prefetch). The prefetches include both streaming and non-streaming prefetches. • %Miss - Dfetch This is the percentage of the number of demand instruction fetch misses out of the total instruction demand fetch accesses.
The metrics are: • Total - Misses Per Second This is the total number of L2 cache misses per second. It includes all instruction prefetch misses, instruction demand misses, and data misses. • Pfetch - Misses Per Second This is the number of instruction line prefetch requests (streaming and non-streaming) that miss the L2 cache per second. • Dfetch - Misses Per Second This is the number of instruction line demand requests that miss the L2 cache per second.
• Instr Per Access This is the ratio of the total number of instructions retired per L2 cache access, including nops and predicated off instructions. The L2 cache accesses include RSE stores, VHPT loads, all integer and RSE loads that miss the L1 data cache, all integer stores, all floating-point loads/stores, semaphores (counted once), and instruction fetches/prefetches that miss the L1 instruction cache.
The metrics are: • Total - Misses Per Second This is the total number of L2 data cache misses per second. It includes all data load and store misses. • Load - Misses Per Second This is the number of data load requests that miss the L2 cache per second. • Store - Misses Per Second This is the number of data store requests that miss the L2 cache per second. • Writebacks Per Second This is the total number of L2 data cache writebacks (L3 hit and miss) per second.
There are a number of issues regarding L2 instruction cache access that need to be considered when interpreting L2 cache measurement results. The L2 cache will not count fetches to the second half of a line if the fetch for the first part is already counted. Secondary misses are counted as data references. Only requests that have entered the OZ queue are counted. And these instructions are not counted: FROM_CCV, SETF, PTC_G, FWB, MF, MFA, SYNCI, SYNCIA, PTCM, FC, and CC.
• Instr Per Access This is the ratio of the total number of instructions retired per L2 instruction cache access, including nops and predicated off instructions. The L2 instruction cache accesses include demand fetches and prefetches that miss the L1 instruction cache. • %Miss - Total This is the percentage of all the L2 instruction cache misses out of the total number of L2 instruction cache accesses. Accesses include instruction fetches/prefetches that miss the L1 instruction.
• Dfetch - Misses Per Second This is the number of instruction line demand requests that miss the L3 cache per second. • Data - Misses Per Second This is the number of data (load and store) requests that miss the L3 cache per second. This count includes writebacks from the L2 cache that miss the L3 cache. • Writebacks Per Second This is the total number of L2 cache writebacks (L3 hit and miss) per second.
Metrics Available for Intel® Itanium® 9500 series systems • Total misses per second This is the total number of L3 cache misses per second. It includes all instruction misses and data misses. • Inst - Misses Per Second This is the number of instruction requests that miss the L3 cache per second. • Data - Misses Per Second This is the number of data requests that miss the L3 cache per second.
The metrics are: • Read Rate Number of memory read requests per second. • Live Reads This is the average number of outstanding reads per cycle. This gives some idea about the memory request density. • Ave Latency - Cycle Average system memory read latency in CPU cycles. • Ave Latency - Nsec Average system memory read latency in nanoseconds. • Pftch This is total number of cacheable instruction prefetch memory requests per 1000 retired instructions, including nops and predicated off instructions.
The queues event set provides bus request queue (BRQ) information that might give insight into possible performance problems related to the system bus. The BRQ is a centralized queueing structure that collects almost all requests from the L1 cache and then schedules those requests to the L2 cache or front side bus (FSB). High values on the available metrics will likely indicate levels of bus utilization. This can be confirmed with the sysbus event set.
metric provides the average number of requests that are live in the OOQ per cycle. High numbers of OOQ entries indicates excessive snoop response timing. snoop Event Set Available only on Itanium 9300 quad-core processor systems. The snoop event set provides data about snoop responses. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement.
• 64 Byte - Hit This is the fraction of 64-byte data snoops that hit a cache line, out of all data snoops (64-byte and 128-byte). • 64 Byte - Hitm This is the fraction of 64-byte data snoops that hit a modified cache line, out of all data snoops (64-byte and 128-byte). • 64 Byte - Impwb This is the fraction of 64-byte data snoops that are due to implicit write backs, out of all data snoops (64-byte and 128-byte).
If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement. You can use command-line options to limit the scope of the measurement.
page walker (HPW) is invoked to insert the required page into the level 2 TLB, which is then forwarded to the level 1 data TLB. • L2Dtlb This counts the number of cycles stalled due to a level 2 data TLB miss during the time the HPW is actively attempting to resolve the requested TLB entry. If the entry is not in the cache, the HPW will terminate and initiate a trap to software to provide the required TLB entry. This component counts the stall component only due to the HPW providing the required TLB entry.
measurement. You can use command-line options to limit the scope of the measurement. Specifically, you can: • Limit measurement to a specific privilege level: -m event_set[:all|user|kernel] • Include idle: --exclude-idle False • Exclude the interruption state: --measure-on-interrupts off • Only measure the interruption state: --measure-on-interrupts only Metrics Available from this Measurement The following metrics are available from this event set.
• Util Data Data bus utilization gives a lower bound estimate of total data bus utilization resulting from bus transactions that result in a data transfer, that is, BRL, BRIL, BWL, and nonzero byte BRP/BWP transactions. A lower bound data bus utilization is computed as follows: DATA BUS CYCLES/SEC = ((BRL + BRIL + BWL + IMPLICIT WB)/sec * 4.0) + ((nonzero byte BRP's/BWP's)/sec * 1.0) DATA UTIL = 100 * (DATA BUS CYCLES/SEC) / BUS CYCLES SEC The constants (4.0 and 1.
HyperThreading (formally called Hyper-Threading Technology) provides the ability for a processor to create an additional logical processor that might allow additional efficiencies of processing. For example, a dual-core Itanium 2 processor with HyperThreading active provides four logical processors, two on each core. An Itanium 9300 quad-core processor with HyperThreading active provides eight logical processors. This allows the operating system to schedule two threads or processes simultaneously.
• 64–255 Percentage of thread switches that were triggered after the processor had stalled for 64 to 255 cycles. A non-zero value represents wasted processor cycles. • >=256 Percentage of thread switches that were triggered after the processor had stalled for 256 or more cycles. A non-zero value represents wasted processor cycles. • Overhead Cycles Per Sec Number of processor cycles per second consumed by the thread switching itself.
• ◦ unstall HPW insert or MLD return or IBQ not empty. ◦ timeslice completion of the allocated timeslice of the executing thread. Stall Cycles spent in stalls during threadswitches. tlb Event Set The tlb event set provides information related to translation lookaside buffer (TLB) misses. The Itanium 2 TLB implementation is split for instructions and data, with two levels for each. The first level only maps 4K pages. Thus, the miss rate (per sec/per kinst) might be quite high.
• D1TLB Misses Per Sec This is the number of level 1 DTLB misses per second. This level of the DTLB only operates on 4K pages. Thus, its miss rate will be high, but it is normally the case that any required translation would be provided by the level 2 DTLB in three cycles. • D2TLB Misses Per Sec This is the number of level 2 DTLB misses per second. A miss at this level will attempt to be serviced by the HPW.
Glossary advance load address table (ALAT) In the Integrity servers processor family, a table that keeps track of speculative (that is, advance) loads. An excessive number of ALAT compares that result in a failed advance load (an ALAT miss) can seriously degrade performance. advice class A grouping for advice from the Advisor. Every piece of advice belongs to one of these classes: general, CPU, memory, IO, and system.
data speculation The execution of a memory load prior to a store which preceded it and which might potentially alias with it. Data speculation loads are also referred to as advance loads. See “dspec Event Set” (p. 239). databases directory The directory where output databases are created for each data collection run of HP Caliper, unless you use the -d option. By default, the databases directory is a directory called .hp_caliper_databases in your current directory.
hot spot An instruction or set of instructions that has a higher execution count than most other instructions in a program. HP Caliper Advisor A rules-based expert system that gives guidance about improving the performance of an application. See “Using the HP Caliper Advisor” (p. 76). HP Caliper option A parameter in the HP Caliper command line used to customize the performance analysis. See “HP Caliper Options” (p. 47).
measurement configuration file A file that HP Caliper uses to perform a particular measurement, such as scgprof or icache. Each measurement has a corresponding measurement configuration file. See “HP Caliper Measurement Configuration Files” (p. 42). measurement run folder In the HP Caliper GUI, a folder that contains information about the types of data available for a single measurement run. It can also contain the collection specification used to collect the data in the folder.
sampled measurement A measurement that measures your program's performance at regular intervals, based on CPU events, recording the current program location and selected performance metrics. See “Sampled Measurements” (p. 26). scgprof measurement A measurement, provided by the scgprof measurement configuration file, that measures and reports (an inexact) call graph profile, produced by sampling the performance monitoring unit (PMU) to determine function calls.
Index Symbols --[no]fold option, 61 --advice-classes option used with HP Caliper Advisor, 79 --advice-cutoff option used with HP Caliper Advisor, 79 --advice-details option used with HP Caliper Advisor, 79 --analysis-focus option used with HP Caliper Advisor, 79 --branch-sampling-spec option, 54 --bus-speed option, 55 --callpath-cutoff option, 55 --context-lines option, 56 --cpu-aggregation option, 56 --cpu-counter option used with caliper info command, 101 --cpu-details option, 56 --cpu-metrics-aggregation
-p some option syntax, 98 -r option, 51 -s option, 52 used with caliper info command, 102 -t option, 74 -w option, 53 .
Dual-core Itanium 2 processor HyperThreading information, 113 E ecount measurement report description, 204 Enabling the PMU, 162 Environment variables HP Caliper, 103 Error messages, 165 Event name abbreviation error, 93 Event name abbreviations showing, 94 Event set descriptions for cpu measurement, 230 Event sets brpath, 230 brpred, 231 c2c, 233 cpi, 234 cpubus, 236 cspec, 237 dispersal, 238 dspec, 239 fp, 241 l1dcache, 243 l1icache, 244 l2cache, 246 l2dcache, 248 l2icache, 249 l3cache, 251 memreq, 253 q
Measurement global, 26 precise, 26 sampled, 26 Measurement configuration file, 42 Measurement configuration files Overview measurement, 44 provided with HP Caliper, 42 Simultaneous fprof sampling on multiple PMU Counters, 45 using, 45 Measurement types, 44 Measurements types you can take, 26 Measuring load modules, 94 default settings for, 94 Memory usage measuring concurrently, 153 memreq event set, 253 merge command see caliper merge command Merging performance data, 115 Metrics used for sorting and cutof
Showing HP Caliper options, 28 Simultaneous fprof sampling on multiple PMU Counters, 45 snoop event set, 256 Sorting metrics used for, 105 Source line data shown in reports, 111 Source position correlation, 111 Source statements omitting from reports, 51 Source, adding to report, 24 Specifying modules, 95 Specifying option values with a .