HP Caliper User Guide Release 5.3 February 2011 HP Part Number: 5900-1558 Published: February 2011 Edition: 5.
© Copyright 2011 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Contents About This Document...................................................................................12 1 HP Caliper at a Glance............................................................................16 What Is HP Caliper?...............................................................................................................16 What Does HP Caliper Run On?..............................................................................................
Remote GUI......................................................................................................................40 4 HP Caliper Measurement Configuration Files................................................42 Measurement Configuration Files Provided with HP Caliper..........................................................42 Overview Measurement...........................................................................................................
Example...........................................................................................................................59 --etb-walkback-cycles...............................................................................................................60 --event-defaults........................................................................................................................60 Example..................................................................................................
--threads................................................................................................................................74 --traps-reported.......................................................................................................................74 --user-regions (HP-UX only) .................................................................................................74 --version.................................................................................................
HP Caliper Environment Variables...........................................................................................103 9 Controlling the Content of Reports.............................................................104 Layout of an HP Caliper Text or CSV Report.............................................................................104 Metrics You Can Use for Report Sorting and Cutoffs..................................................................105 Module-Centric Reports..................
Limitations to Using cstack.....................................................................................................149 Pstack like functionality..........................................................................................................149 12 Performing CPU Metrics Analysis (HP-UX only) 13 HP Caliper Features Specific to HP-UX (HP-UX only) ................................151 .........................152 Measuring Memory Usage Concurrently with Other Measurements (HP-UX only) ..
cpu Measurement Report Description (HP-UX only) .......................................................178 Example Command Lines for Text Report.............................................................................179 Example Command Line for CSV Report..............................................................................179 CPU Event Sets................................................................................................................179 cstack Measurement Report Description...
Example Command Line for CSV Report..............................................................................201 fprof Metrics Summed for Entire Run...................................................................................201 Metrics for Integrity Servers Itanium 2 Systems.................................................................201 Metrics for Integrity Servers Dual-Core Itanium 2 and Itanium 9300 Quad-Core Processor Systems...............................................................
Metrics Available from this Measurement............................................................................226 dispersal Event Set................................................................................................................227 Metrics Available from this Measurement............................................................................227 dspec Event Set....................................................................................................................
About This Document This document describes how to use HP Caliper to measure the performance of native applications running on HP-UX and Linux Integrity servers. NOTE: For the latest version of this document, go to the HP Caliper Web site at the following URL and click on Documentation in the Product Information box: http://hp.com/go/caliper This document is sometimes updated after a release. The document publication date appears on the title page.
For information about the HP Caliper Advisor, read this chapter: • “Using the HP Caliper Advisor” (p. 76). For information about how to configure HP Caliper to collect data and report the results, read these chapters: • “Configuring HP Caliper ” (p. 91) describes how you can configure HP Caliper to collect data. • “Controlling the Content of Reports” (p. 104) describes how to control the content of reports based on the data collected.
GUI item A graphical user interface (GUI) item such as a button or menu name. [] The contents are optional in syntax. If the contents are a list separated by |, you must choose one of the items. {} The contents are required in syntax. If the contents are a list separated by |, you must choose one of the items. ... The preceding element can be repeated an arbitrary number of times. | Separates items in a list of choices.
• Using HP Caliper to analyze effective floating-point load latency • Using HP Caliper with an application program to characterize the Itanium memory hierarchy • Using HP Caliper to measure performance data related to translation lookaside buffers (TLBs) You can also read these technical reports about the microarchitecture used in HP Integrity servers: • Dual-Core Update to the Intel® Itanium® 2 Processor Reference Manual for Software Development and Optimization, Document Number 308065-001.
1 HP Caliper at a Glance What Is HP Caliper? HP Caliper is a general-purpose performance analysis tool for applications on HP-UX and Linux systems running on HP Integrity Servers. HP Caliper allows you to understand the performance and execution of your application and to identify ways to improve its run-time performance. HP Caliper works with any native Integrity Server application.
Figure 1 HP Caliper Components (User Interfaces) HP Caliper CLI Application Performance reports HP Caliper HP Caliper GUI (local) X11 server HP Caliper GUI (remote) HP Caliper database(s) Integrity Server (HP-UX or Linux) X86 desktop (Windows or Linux) HP Caliper selectively measures the processes, threads, and load modules of your application.
In general, HP Caliper runs do one of the following: • Collect data • Collect data and generate a report • Generate a report based on previously collected data • Analyze previously collected data For the last item above, HP Caliper provides the HP Caliper Advisor, a rules-based expert system designed to provide guidance about improving the performance of an application. Users can write their own rules to analyze applications or use the default rules provided.
Summary of HP Caliper Features HP Caliper's most important features include the following: • Performance data is automatically saved in databases, which you can use to generate reports without having to remake the measurements. Multiple databases can also be combined for aggregated results. • All reports are available in text format and comma-separated-value (CSV) format for use with spreadsheets.
2 Getting Started with the HP Caliper Command-Line Interface This chapter provides some example programs to show you how to get started using the HP Caliper command-line interface. The programs are chosen for illustration purposes and are not necessarily representative of programs you might actually want to analyze. Example: Running fprof on a Short Program, with Default Output HP Caliper provides many types of performance measurements.
Figure 2 fprof Measurement Report for matmul, with Default Report Output ================================================================================ HP Caliper 4.3.
Target Execution Time 10 Real time: 0.428 seconds User time: 0.415 seconds System time: 0.008 seconds Sampling Specification 11 Number of samples: 1319 Data sampled: IP Metrics Summed for Entire Run 12 ----------------------------------------------PLM Event Name U..K TH Count ----------------------------------------------CPU_CYCLES x___ 0 659001879 BACK_END_BUBBLE.ALL x___ 0 99866365 BE_EXE_BUBBLE.
3 ~83 > --------------------------------------------------[Minimum function entries: 0, percent cutoff: 1.00, cumulative percent cutoff: 100.00] 1 2 3 4 5 6 7 8 HP Caliper 4.3.0: Report Summary for Flat Profile: The heading for the report, including the HP Caliper version number and the measurement (Flat Profile). Collection Run 1: (Flat Profile): The heading for the run. Processor Information: Information about your processor. Run Information: Information about the run.
◦ A source-code line number for rows showing statements ◦ An instruction slot number for rows showing instructions not on a bundle boundary ◦ A source-code column number followed by an offset from the beginning address of a function for rows showing instructions on a bundle boundary • 16 >Statement | Instruction: The column contains either a source statement, preceded by “>”, or a disassembled instruction. Statements that are out of order due to optimization are preceded by “*>”.
Figure 3 fprof Measurement Report for matmul, with IP Sample Counts for One Function Function Details --------------------------------------------------% Total Line| IP IP Slot| >Statement| Samples Samples Col,Offset Instruction --------------------------------------------------1 96.88 [matmul::main, 0x40009a0, matmul.c] 1275 ~38 Function Totals 2 -----------------------------------------3 [/home/meagher/matmul.c] 4 (32) ~16 *> mata[i][j] = matb[i][j] = (float) rand() ; ~9,0x0280:0 M nop.m 0 :1 M nop.
Types of Measurements HP Caliper is capable of three types of performance measurement: • A global measurement of total run metrics • A sampled measurement based on the granularity you specify • A precise measurement of every execution path in your code (HP-UX only) See Table 1 (page 44). Global Measurement A global measurement gives you a single value for a specific metric of your program, such as total CPU time used. The only global measurement available in HP Caliper is ecount (total CPU time).
Precise measurements are best used for: • Identifying the most and least used functions in your program • Identifying all the branch paths executed in the program Collecting precise measurements requires more system resources than sampled measurements. Collecting precise measurements also affects the performance of the program being measured. The performance effects may vary from a few percent to 300 percent, depending on how much measurement you request.
caliper_options Parameters used to customize the performance analysis. For more information, see “HP Caliper Options” (p. 47). program The name of the executable program you want HP Caliper to measure. program_arguments Any number of arguments expected by your executable. You can use an options file to specify command-line information, including the measurement, options, program, and program arguments. See “-f or --options-file” (p. 49) for details.
The first command produces a call graph by sampling. The second command (on HP-UX only) produces an exact call graph. They both produce an enhanced gprof-like output. Creating a Text Report for Analysis To save the report produced by HP Caliper to a file, specify an output file name: $ caliper measurement -o filename [caliper_options] program [program_arguments] Creating a Report Based on Your Collected Data By default, HP Caliper saves the results of a measurement to a database.
Additional HP Caliper Commands In addition to the caliper measurement command, there are three more HP Caliper commands you can use. For information about these commands, including required syntax, see the references below: • caliper info Displays reference information about the CPU counters or reports. See “How to Display Reference Information About CPU Counters or HP Caliper Report Types” (p. 101). • caliper report | merge | diff Creates a report from an HP Caliper database.
3 Getting Started with the HP Caliper GUI In addition to the command-line interface, HP Caliper supports a full-featured, intuitive graphical user interface (GUI). This chapter describes how to get started using the GUI. For information on the command-line interface, see Chapter 2 (page 20). What Is the HP Caliper GUI? The GUI has the same underlying measurement technology and capabilities as the command-line interface. With the GUI, however, you can dynamically interact with HP Caliper.
• Diagnostics view • Help view As is typical of most GUIs, the HP Caliper GUI lets you reconfigure, resize, and reposition all of the views to suit your needs. Views that are not currently needed can be closed (and reopened when needed) to make more room for others.
Each HP Caliper measurement run produces several datasets. These datasets are shown in the Projects view for each run. The figure below shows the Projects view: Figure 5 Projects View Collect View The Collect view allows you to set up and make performance measurements. It consists of a series of tabbed pages (which are not themselves views) containing all the information needed to run your application and all the measurement parameters that you can control.
Figure 6 Collect View Analyze View The Analyze view lets you explore the performance data you collect. When displayed, the Analyze view is located, by default, to the right of the Projects view, overlaying the Collect and Advisor views. Any performance data you have available for viewing is shown in the Projects view. To open the Analyze view, double-click a performance data icon of interest in the Projects view.
Figure 7 Analyze View Advisor View The Advisor view contains a set of suggestions for improving the performance of your application based on the data collected so far. When displayed, the Advisor view is located, by default, to the right of the Projects view, overlaying the Collect and Analyze views. To open the Advisor view, click on the Generate Advice button analyze the collected data and produce advice output. or toolbar choice.
Figure 8 Advisor View Console View The Console view displays any output your application writes to standard output and standard error streams. You can also use the Console view to provide any input your application expects to read from standard input. The Console view is below the Collect view, by default, and is visible when your application is being measured.
Diagnostics View The Diagnostics view contains any warning messages that HP Caliper might generate when measuring your application or retrieving its performance data for viewing. By default, this view overlays the Console view at the bottom of the GUI window. Any errors produced will appear in popup dialogs.
Tips for Using Views All views have the following features: • Each view has its own Maximize and Minimize buttons (top right), and many views have their own pull-down menus (also top right). • Double-clicking a view's tab causes the view to take up the entire GUI window. Double-clicking a view's tab a second time returns it to its previous size and restores the previous GUI layout. This feature is particularly useful when viewing performance data.
◦ The measured application completes. ◦ All the attached processes terminate. ◦ The measurement duration you set on the Target page expires. ◦ You select the Kill/Stop button. The application program being measured will be terminated immediately if you select the Kill button. • When a measurement run completes, its performance data is automatically added to the current project within the Projects view.
Getting Help Several forms of online help are available in the GUI: • “Getting started” help Select Help→Help Contents and then choose Getting Started. • Dynamic/context help Select Help→Context-sensitive Help or use the F1 key. This help provides detailed information specific to the view that currently has focus. • Reference help Select Help→Help Contents.
You will need to copy the appropriate GUI client (in the gui_clients subdirectory) to your Windows or Linux desktop system and unpack it. Then, start the GUI from your desktop using the following executable file. Invoke it from a shell prompt or double-click it in a folder: • On Windows: Caliper.exe • On Linux: Caliper At startup, the GUI prompts you for the login information needed to connect it to the remote HP Caliper server on the Integrity system where you want to make measurements. See Figure 12.
4 HP Caliper Measurement Configuration Files Each run of HP Caliper uses a particular measurement, which you can specify in the command line. Each measurement corresponds to a particular measurement configuration file supplied by HP Caliper. The measurement configuration files contain variables that control the types of measurements performed and the content of the reports.
• dtlb The dtlb measurement measures and reports sampled data translation lookaside buffer (TLB) misses. See “dtlb Measurement Report Description” (p. 193). • ecount The ecount measurement measures and reports total CPU event counts. See “ecount Measurement Report Description” (p. 197). • fcount (HP-UX only) The fcount measurement measures and reports function call counts in a program. See “fcount Measurement Report Description ” (p. 199).
Table 1 Available Measurements in Each Measurement Type Global Sampled Precise (HP-UX only) ecount alat branch cgprof cpu cstack cycles dcache dtlb fprof icache itlb pmu_trace scgprof traps cgprof fcount fcover NOTE: The cgprof measurement performs both sampled and precise measurements. The measurements in the sampled category, with the exception of cpu and pmu_trace, show results grouped by function. A report produced by any of these measurements is referred to as a PMU histogram report.
$ caliper overview -o rpt --switch-interval 3 \ --fprof-sampling-spec 1000000 \ --dcache-sampling-spec 20000,5%,DATA_EAR_EVENTS \ --cstack-sampling-spec 250ms my_app In the above example, the overview measurement will run the fprof measurement for 3 seconds, dcachemeasurement for 3 seconds, and cstack measurement for 3 seconds continuously until the program terminates.
You are free to rename measurement configuration files. Specifying Option Values in Measurement Configuration Files You can specify options on the command line, in a measurement configuration file, or in the .caliperinit file. See “Multiple Ways to Specify HP Caliper Option Values” (p. 47). Using the Command Line to Override Measurement Configuration File Parameters You can use the HP Caliper command line to override parameters specified in measurement configuration files.
5 HP Caliper Options This chapter describes basic information about options and presents them in alphabetical order. For a listing of the most commonly used options, see the HP Caliper Quick Start reference card. Basic Information About Options Options are used to customize the performance analysis. You can specify one or more options on the command line when you start HP Caliper. You can abbreviate options and their modifiers as long as they are unambiguous.
Hierarchy for Processing an Option Value HP Caliper uses this sequential order to process an option value: 1. 2. 3. 4. Default value for an option Option variable setting in the specified measurement configuration file Option variable setting in the .caliperinit file, if the file exists Option value from the command line Thus: • The command line overrides everything. • The .caliperinit file overrides the measurement configuration file. • The measurement configuration file overrides the default value.
-f or --options-file -file options_file Specifies a text file containing a list of HP Caliper command-line options separated by spaces or line breaks. You can also use an options file to specify an HP Caliper measurement as well as the application to be profiled and its arguments. Any option you specify on the command line overrides the corresponding setting in the options file. HP Caliper places the contents of the options file in the position occupied by the -f option in the command line.
If you use this option but do not specify an event, or if the option value is set to the empty string (""), then no metrics will be reported. You can use the caliper info command to list available CPU events and their descriptions. cpu_event Specifies a CPU event to measure. The name is not case-sensitive. For information about CPU events you can specify, see “Specifying Which CPU Events to Measure” (p. 93).
When you generate multiprocess reports, you can specify whether results are combined in a single report file or in individual files by process: per-process Creates individual report files for each process with program name appended to each file. shared Creates a single file containing the results for all processes. This is the default setting. unique Appends the process ID to the data file name.
-r for Function Coverage Reports -r [module][:directory][:file][:function][:unknown][none][all] Default value is module:directory:file:function:unknown. module Shows data by load module. directory Groups data by source directory. file Generates Summary Report by source file. function Shows function level detail by source file. unknown When used together with the other report options, provides additional information about functions from unknown source files in the summary and detail coverage reports.
subtracts this number from the interval to vary the sampling frequency. You can specify the actual number of events by which to vary the sampling rate, or a percentage of the count by using a percent symbol (%). For example: -s CPU_CYCLES,10000,10% The default value is 5 percent. cpu_event Specifies a CPU event to measure. The name is not case-sensitive. For information about CPU events you can specify, see “Specifying Which CPU Events to Measure” (p. 93).
-w Equivalent to one form of the option for system-wide measurement. The -w option is equivalent to -–scope system,attr-mod, which is the default for -–scope system. See “Using --scope system for System-Wide Measurements” (p. 70). --advice-classes Used only with the caliper advise command. See “Command Line to Invoke the Advisor” (p. 78). --advice-cutoff Used only with the caliper advise command. See “Command Line to Invoke the Advisor” (p. 78). --advice-details Used only with the caliper advise command.
processors). Do not change this default value: doing so will result in a useless call graph. threshold=int An integer value that specifies how HP Caliper counts events. The default value is zero. Do not change this default value: doing so will result in a useless call graph. privilege-level-mask=level Determines the privilege level setting for a given counter. By default, counters are measured when your application runs in user space (user).
percent_cutoff The percentage of the total for the sort metric that a given call path must exceed to appear on the report. Default value is 1.0. cum_percent_cutoff The value of the cumulative percentage at which HP Caliper stops reporting call paths. Default value is 100. min_count The minimum number of call paths to be displayed. Default value is 5. For more information, see Chapter 11 (page 132).
For more information, see “Performing CPU Metrics Analysis ” (p. 151). NOTE: This option was formerly known as --cpu-metrics-details. The former option name is still accepted by HP Caliper, but will be removed in a future release. --csv-file --csv-file filename[append|create][,per-process|shared][,unique] Generates report output in Comma Separated Values (CSV) format. You can produce a CSV report for any HP Caliper measurement.
ADDR_MATCH is the 64-bit address to match. ADDR_MASK is the 56-bit address mask to apply before matching the ADDR_MATCH bits. PROC_FLAGS is a comma-separated list of none , d, io, or iod. none indicates no constraint. d indicates data address matching only. io indicates instruction address and opcode matching. iod indicates instruction address, opcode and data address matching.
percent_cutoff The percentage of the total for the sorting/cutoff metric that a given function must exceed to appear on the report. This is shown as percent cutoff on reports. Default value is 1.0. This value only takes effect if the Percent of Total column is selected for the report. cum_percent_cutoff The value of the cumulative percentage at which HP Caliper stops reporting results. This is shown as cumulative percent cutoff on reports. Default value is 100.
--etb-walkback-cycles --etb-walkback-cycles integer Controls the number of cycles to walk back when iterating from the most recent execution trace buffer (ETB) entry to the oldest ETB entry. Use this option to change the way in which HP Caliper picks the instruction pointer (IP) samples from the 16 IP entries in an ETB sample. When iterating from the most recent entry, HP Caliper computes the cumulative elapsed cycles by adding up each entry's bubble cycles plus one cycle per entry.
--exclude-caliper --exclude-caliper True|False Specifies whether to include, in measurements, the activity due to the HP Caliper process. Used only with the --scope system option. The default value is True (HP Caliper process activity is excluded). See “--scope” (p. 69). --exclude-idle --exclude-idle True|False Specifies whether to include, in measurements, the periods during which a given CPU is executing the idle loop. Inclusion takes effect for each CPU separately.
• pmu_trace • scgprof Collections with the above measurements will always be processed as if --group-by none were specified. See “Module-Centric Reports” (page 107). none Specifies that data from matching processes or modules should not have their data combined. This is used only for the caliper and caliper report commands. See “Creating Reports from Multiple Databases” (page 114). --help See “-H or --help” (p. 49).
Causes HP Caliper to collect data for inline functions. The default value is --noinlines. You can use this option with the following measurements: alat, branch, cgprof, dcache, dtlb, fprof, fcount, icache, itlb, and scgprof. For the cgprof and fcount measurements, this option should be specified during data collection as well as reporting, because the inline functions must be instrumented at collection time for the data to be available at report time.
Specifies whether to extend the callstack samples collection into kernel space. By default, both the userspace and kernelspace callstack samples will be collected. This option is used only with the cstack measurement. --latency-buckets --latency-buckets True|False Specifies whether or not the latency bucket information should appear in dcache measurement reports. This option is used only with the dcache measurement. The default value is --latency-buckets True (the information appears).
--module-default --module-default all|none Specifies the default setting for load module inclusion in the measurement. If --module-default none is set, then HP Caliper excludes all modules and only looks at the --module-include list. If --module-default all is set, then HP Caliper includes all modules and only looks at the --module-exclude list. See “Specifying Which Load Modules to Collect Data For” (p. 94). --module-exclude --module-exclude module1:module2:...
NOTE: In --scope system measurements on HP-UX, HP Caliper cannot locate an executable or a shared library if it is invoked using a relative path. In addition, at certain times, executables and shared libraries cannot be located even if they are specified with complete paths. This problem is due to limitations in APIs provided to collect information about executables and shared libraries associated with a process on HP-UX.
Specifies whether the target application should be blocked when the PMU sampling buffer is full. The default is TRUE (i.e., the target application will be blocked until HP Caliper has completed processing all the samples in the buffer). This option is valid only for PMU based per-process measurements on Linux. --per-module-data --per-module-data True|False Specifies that all function histograms will be reported by load module instead of the default of reporting across load modules.
cum_percent_cutoff The value of the cumulative percentage at which HP Caliper stops reporting results. This is shown as cumulative percent cutoff on reports. Default value is 100. min_count Sets the minimum number of primitives to be displayed. Default value is 10. This cutoff also controls the number of holder and waiter thread entries reported for thread synchronization primitives (HP-UX only). For more information, see Chapter 11 (page 132) --process See “-p or --process” (p. 51).
Example If you specify: $ caliper fprof --process-cutoff ,80,0 -w The contents of the Process Summary section is a list of processes containing: • The processes that account for 80 percent of the total IP samples of all the processes running in the system. • Only those processes that each account for more than two percent of total samples. Because percent_cutoff was not specified, HP Caliper used the default value, 2 percent.
all with the --event-defaults option. The default value is all.) Every processor set (pset) is measured. The samples can be attributed to processes, or to processes and modules, or not attributed. For example: • --scope system,attr-mod Measure for system activity, and attribute samples to processes and modules within those processes whenever possible. Samples will be attributed to functions within those modules, and assembly and source listings in the Function Details sections are available.
When --scope system is used, for most measurements, HP Caliper measures all user and kernel activity: either all user and kernel activity or individual processes or the modules of those processes. When --scope system is used, HP Caliper continues collecting data until you stop it with Ctrl-C. You can also specify the number of seconds to collect data with the -e option. For example, to create a Flat Profile (fprof) report for all activity on the system for 20 seconds: $ caliper fprof -o fprof.
--source-path-map --source-path-map pathmap1[:pathmap2:...] Specifies the path map to use for finding source files used for reporting source statements. Applies to any PMU histogram report, which is the only kind of report that references source code. Path map entries are separated by a colon (:) and applied in order until HP Caliper finds a file match. • Simple entries are prepended to file names. • You can provide substitute paths by using comma-separated entries.
This value only takes effect if the cumulative percent column is selected for the report. Sets the minimum number of functions to be displayed for all load modules. min_count Default value is 5. For example, if you specify the command line: caliper fprof --summary-cutoff ,80 wordplay The contents of the function summary section will be a list of functions containing: • The functions that account for 80 percent of the total IP samples in the wordplay program.
--threads --threads sum-all|all Enables per-thread reporting. Default value is all., Collect and report data per thread. all For a multithreaded program, the Function Summary and the Function Details sections of reports show information across threads in addition to the per-thread Function Summary and Function Details sections. sum-all Collect and report data summed across all threads. sum-all measures multithreaded applications as one entity.
For more information, see “Restricting PMU Measurements to Specific Code Regions” (p. 161). --version See “-v or --version” (p. 53).
6 Using the HP Caliper Advisor This chapter introduces you to the HP Caliper Advisor and provides some example programs to show you how to get started using the Advisor from the command line. For information on how to use the Advisor in the HP Caliper graphical user interface (GUI), see Chapter 7 (page 85). For details about how to write rules for the Advisor, see the HP Caliper Advisor Rule Writer Guide.
Example 1 HP Caliper Advisor Report =========================================================================== HP Caliper 4.3.0 Advisor Report for my_app =========================================================================== Analysis Focus Executable: Last modified: Processor type: Processor speed: OS version: /tmp/my_app August 15, 2004 at 03:10 PM Itanium2 9M 1599 MHz HP-UX 11.23 Performance Databases /home/me/.hp_caliper_databases/cpu - March 23, 2005 at 11:17 AM /home/me/.
Figure 13 Steps in Using the Advisor Ma ke sugg ested chang es Buil d appl icat ion Start On e or more HP Calip er performanc e runs HP Calip er Advisor Gain better und erstandin g of appl icat ion performanc e End Ma ke sugg ested performanc e runs To use the HP Caliper Advisor, you perform these steps: 1. 2. 3. 4. Build the application with an initial set of compiler/linker options.
--analysis-focus [executable:]name|all[,[executable:]name],... –o outputfile[,append|create] --rule-files rulefile1[,rulefile2,...] For these options: --advice-classes Specifies which classes of advice are printed. It can be all or any combination of general, cpu, memory, io, or system, separated by colons (:). The default is all. --advice-cutoff Specifies how much of the advice to print. All advice is sorted by its index value (the greater the index, the greater the importance).
For information about what the options mean, see “How to Read an Advisor Report” (p. 82). As with the HP Caliper command-line options, each of the Advisor’s command-line options has a variable counterpart in the .caliperinit file that can set an option value. The variable name is the same as the option, with hyphens (-) replaced with underscores (_). Later uses of the same command-line option or .caliperinit file variable overrides earlier uses.
$ caliper cpu my_new_app or: $ caliper ecount my_new_app followed by: $ caliper fprof my_new_app $ caliper dcache my_new_app Then, run the Advisor on the composite performance data: $ caliper advise Explanation of Report Output Figure 14 (page 81) shows the report output from the Advisor. The report is explained further in “How to Read an Advisor Report” (p. 82).
1 2 3 4 Application object being analyzed, which version (when it was last modified), the processor type and speed, and operating system version. Performance databases being analyzed. Rule files that were used. Advice section, giving performance tuning advice. 5 6 7 First piece of advice, set off by a line of dashes (--------). Second piece of advice, set off by a line of dashes (--------). Cutoff settings, which specify how much of the advice to print.
------------------------------------------------------------------------------Index Class Analysis ------------------------------------------------------------------------------23.9 cpu Function profile 1 [cpu_fprof_1] 2 The percentage of ITLB misses (16.6%) is higher than normal. This may indicate a poor setting for the virtual memory instruction page size. 3 Try adding "+pi 4M" to the application's link command.
• The ordering of rule files and databases on the command line makes no difference to the results produced by the Advisor. The only exception is in the case where the databases contain data from different, incompatible systems for the same executable object. • If you want to use multiple rule files, consider writing a “super” rule file that merely ‘includes’ the real rule files. If you do this, only the super rule file needs to be given on the command line.
7 Using the HP Caliper Advisor in the GUI This chapter describes how to use the HP Caliper Advisor in the HP Caliper graphical user interface (GUI). It assumes that you have some familiarity with the Advisor. For information about the HP Caliper Advisor, see Chapter 6 (page 76). For information about the HP Caliper graphical user interface (GUI), see Chapter 3 (page 31).
Figure 15 HP Caliper GUI In this screen shot of the GUI, you can see that three measurement runs have already been made: two in the Before Changes project (a CPU Cycles Run and a Data Cache Misses Run) and one in the After Changes project (a CPU Cycles Run). The application being measured is the HP C/C++ compiler, compiling the “Hello World” program. The application consists of three processes: cc, ecom, and ld. Note that these are default measurement runs.
If you have a special situation, there are two ways to select what performance data the HP Caliper Advisor analyzes: • You can select one or more projects (implying all of their measurement runs). • You can select one or more measurement runs from any project. In either case, you select an entire project or a measurement run by clicking on its name in the Projects view. You can select more than one item (on Windows) by holding the Ctrl key while selecting the additional ones.
Figure 17 Projects View, with a Single Measurement Run Selected Generating Advice The easiest step is getting the HP Caliper Advisor to analyze the selected performance data and generate advice. Figure 18 shows the GUI toolbar. The square icon with a blue checkmark inside means check the performance data. If you “hover ” over the icon, the popup tooltip says Generate Advice. Simply click on the icon.
Figure 19 HP Caliper GUI Advisor Menu Generate Advice does the same thing as the toolbar icon: generate new advice from the selected performance data and display it in an Advisor view. Show Advisor View brings up the Advisor view with the advice from the last analysis run. You can use this option to retrieve the Advisor view if you previously closed it. This action also appears in the Window/Show View menu.
Figure 20 Advisor Report in the HP Caliper GUI The individual (potential) performance issues are separated by horizontal lines. The first line of each section gives five pieces of information: the name of the executable, an index value for the issue, which category or advice class (CPU, memory, I/O, and so forth) the issue falls in, a brief description of the performance issue, and the name of the Advisor rule that detected this issue.
8 Configuring HP Caliper HP Caliper gives you multiple methods for configuring how HP Caliper collects data and reports results. Specifying Option Values with a .caliperinit Initialization File If you have an initialization file (called .caliperinit), HP Caliper automatically uses it at startup for data collection or data reporting runs. Putting the options in an initialization file simplifies the command line you use. This file is not required, but can be useful.
Figure 21 .caliperinit File ******************************************************************** #Options applied to all report types. application ='myapp' arguments = '-myarg 2' context_lines = 0,3 summary_cutoff = 1 detail_cutoff =5 source_path_map = '/proj/src,/net/dogbert/proj/src:/home/wilson/work' #Report-specific options.
Configuring Data Collection HP Caliper gives you flexible control over the data you collect from your program. The types of control you have include: • Particular CPU events to measure. See “Specifying Which CPU Events to Measure” (p. 93). • Specific load modules you want to collect data for. See “Specifying Which Load Modules to Collect Data For” (p. 94). • Granularity of the information. See “Controlling Granularity of Data Collection and Reports” (p. 96). • Particular processes to measure.
$ caliper fprof -s ,,IIR vand HP Caliper: usage error: Ambiguous event abbreviation ("IIR") specified for "--sampling-spec". Matches IIR2 (IA64_INST_RETIRED), IIR1 (IA32_INST_RETIRED) Run caliper -h for help.
Default Settings for Load Module Data Collection HP Caliper uses these default settings: module-default all module-include libdl.so module-exclude • uld.so • dld.so • libsin.so You cannot override the settings for uld.so, dld.so, and libsin.so. How to Specify Load Module Names HP Caliper matches load module names in the following way: • If you provide a full path for the module name, only an exact match succeeds.
Controlling Granularity of Data Collection and Reports You can control the granularity of data collection and reports. If you want finer granularity (that is, more samples), use the -s option to lower the number of events between samples. For example, you can change the rate from the default 500,000 cycles to 250,000 cycles to get more samples. However, the increased sampling might have a negative effect on your application's performance.
• The origin column, which identifies whether the process was created via a fork, vfork, or exec. • The handling column, which shows whether the process was measured, tracked, or ignored. • The exit status, which is the final exit code for the process. Figure 22 (page 97) shows an example process tree report.
-p [some:][(opt1,...)]pattern For simple uses: -p glob1[:glob2:...] Matches the executable base name of each new process against each glob pattern. A glob pattern follows the Unix shell-style rules to expand file names. If one or more of those patterns match, the process is measured. Otherwise the process is tracked. For more information, see “Using -p some ” (p. 98). If you specify multiple -p options, the last one takes precedence. Using -p some The syntax for -p some is the most complex.
Table 5 Name Source Options Used with -p some (continued) Option Description arg1 or argv1 The name is argument 1 of the process or "" (empty string) if there is not such an argument. The last option specified takes precedence. Table 6 Process Origin Options Used with -p some Option Description root Denotes the initial root process. fork Matches any process created by fork of a measured or tracked parent process. exec Matches any process created by exec of a measured or tracked process.
Using HP Caliper in Your Build Process You can integrate HP Caliper into your build process by including the HP Caliper commands in your makefile. Using HP Caliper in Testing and Quality Assurance Use these steps with new makefile targets for testing and quality assurance builds: 1. 2. Make predefined HP Caliper performance measurements using your sample data sets. Compare HP Caliper results with results from previous builds to identify performance improvements or regressions.
To specify the length of time before detaching and reporting, use the -e seconds option. The value of seconds represents the number of real-time seconds HP Caliper is attached to the process. The exact placement of -e seconds is not significant. Some example command lines are: $ caliper fprof -e 15 /usr/bin/ls -R $ caliper fprof -e 20 7654 NOTE: The target process being measured does not terminate when HP Caliper detaches from it.
-c or --cpu-counter -c counter_name|keyword|all Specifies what kind of information about the CPU counters should be output. You can specify a partial name. Use all for information on all CPU counters. The -c and -r options are mutually exclusive. If neither is given, then -c is assumed. The output of this option comes from two text files in the HP Caliper directory. See “Specifying Which CPU Events to Measure” (p. 93).
To get all of the descriptive information on the BACK_END_BUBBLE.ALL processor event, use: $ caliper info -d all back_end_bubble.all To get information on the branch report, use: $ caliper info -r branch HP Caliper Environment Variables HP Caliper uses environment variables to control certain default settings. CALIPER_DATABASES Specifies the location of the databases directory.
9 Controlling the Content of Reports HP Caliper allows you to control the content of reports based on the data collected. Processor Information, Run Information, and Sampling Specifications are present by default in all collection run reports. Layout of an HP Caliper Text or CSV Report HP Caliper uses a consistent layout for the sections in all of the measurement reports produced for text or CSV output.
• ◦ Call Graph and Function Indexes for scgprof and cgprof ◦ Hot Call Paths, Call Graph, and Function Indexes for cstack Blocking Primitives Summary ◦ • Report Help ◦ • Hot Call Paths, Call Graph, and Function Indexes for cstack A description of how to get help in understanding the report Diagnostic Messages (possibly) See Table 7. Table 7 Information in HP Caliper Reports Specific to Particular Types of Reports cgprof (HP-UX only) cstack scgprof These reports are unique.
Table 8 Available Metrics for Report Sorting and Cutoffs Report Name Notes Available Metrics alat • sampled-misses (default) branch • target • branch-ways • mispredict (default) • back-end-only-mispredict • call-count cgprof (HP-UX only) • msecs-per-call • samples (default) • seconds • samples (default) cstack • samples-running (HP-UX only) • sampled-blocked (HP-UX only) • avg-latency dcache • latency (default) • sampled-misses • hpw-fills dtlb • l2-fills • sampled-misses (default) • soft-fill
Table 8 Available Metrics for Report Sorting and Cutoffs (continued) Report Name Notes Available Metrics • call-count scgprof • msecs-per-call • samples (default) • seconds traps Default by first trap • samples Module-Centric Reports If you use the --group-by module option, HP Caliper will produce a module-centric report. In a module-centric report, there is no data about individual processes in the collection runs.
Function Summary ------------------------------------------------------------------------% Total Cumulat IP % of IP Samples Total Samples Function File ------------------------------------------------------------------------5.23 5.23 8 libbfd-2.15.92.0.2.so::bfd_hash_... 4.58 9.80 7 libbfd-2.15.92.0.2.so::bfd_hash_... 3.92 13.73 6 libc.so.6.1::__gconv_transform_u... There is no Process Summary information (even though nine processes are measured). In the Load Module Summary, all data in libc.
0.00 100.00 +0 ecom (first set - second set) --------------------------------------------- Function Details Each instruction bundle shown in a Function Details table consists of four rows of data. The top row for the instruction bundle shows data totals for the bundle. The remaining rows show per-instruction data. The bundles shown may or may not be contiguous. You can use the -r (--report-details) option to specify whether reports should contain function source (-rs), instructions (-ri), or both (-ra).
When a branch target is a stub, the target is simply shown as *STUB@address*. Branch Targets in Disassembly Listings By default, the symbols shown for branch targets in disassembly are limited to 30 characters. You can change the limit by setting the following variable in the measurement configuration file or the .
How Functions Are Named in Reports HP Caliper attempts to print the most complete name possible for each function listed in reports. The general format for function names is: load_module_name::function_name For example: libdl.so.l::libdl_init threads::tu_thread_destroy If the load module name is implicit from the context, then HP Caliper prints only the simple function name. Consult a linker load map and disassembly listing, or both, to determine the function.
This information is reported under Processor Information. • Processor set (pset) Every application can (potentially) be run in a different processor set, which can have unique characteristics that impact performance. For each process, HP Caliper detects and reports which processor set was used. Possibilities are: ◦ Default: No processor set was specified. ◦ Kernel: A special processor set that a few kernel processes belong to. (This appears only in system-wide measurements.
How HP Caliper Saves Data in Databases HP Caliper saves performance data for every measurement run to a database. This allows you to regenerate reports from the same performance data without having to rerun your application under HP Caliper. You have these capabilities: • You can generate a new report with different attributes from the saved data. This means that you do not have to rerun HP Caliper on the live program.
By default, HP Caliper does not generate a report file when you specify -d. However, you can generate a report file at the same time by specifying -o.
and database(s) is one of these: • [database ... ] (for caliper report) • [database1 database2 ... ] (for caliper merge) • database2 database1 (for caliper diff) Using the caliper report Command to Create a Report from One or More Databases Use caliper report to create a single output report from one or more databases. The syntax for this command is: caliper report [report_options] [database ...] You can specify multiple databases, either individually or by using wildcards.
Example 2 Example of a caliper merge Run ================================================================================ HP Caliper 4.3.
Database: /home/sujoys/db3 Measurement scope: per-process Sampling Specification Sampling event: CPU_CYCLES Sampling period: 500000 events Sampling period variation: 25000 (5.
Example 3 Example of a caliper diff Run ================================================================================ HP Caliper 4.3.
HP Caliper supports diff reports for all measurements except the ones below: • cgprof (HP-UX only) • cpu (HP-UX only) • cstack • pmu_trace • scgprof Example of How to Use the caliper diff Command Assume these two measurement runs: $ caliper fprof -d fp1 cc himom.c $ caliper fprof -d fp2 cc -c himom.
10 Producing a Sampled Call Graph Profile Analysis HP Caliper can produce a sampled call graph profile report (using the scgprof measurement) from any compiled program. You do not need to compile your program in any special way to use this feature. The call graph is produced by sampling the processor's performance monitoring unit (PMU) to determine function calls. The call graph is not exact, because it does not show every function call, but it is statistically useful. This chapter provides an overview.
Running the HP Caliper Sampled Call Graph Profile You can start HP Caliper from the command line, a shell script, or your program's Makefile to produce a sampled call graph profile. The syntax is: caliper scgprof [caliper_options] program [program_arguments] This measurement uses the --branch-sampling-spec option to control the sampling of the branch trace buffer (BTB)/execution trace buffer (ETB), which produces the statistical call graph. For more information, see “--branch-sampling-spec” (p. 54).
Figure 25 Sampled Call Graph Text Report Example ================================================================================ HP Caliper A.4.3.
-------------------------------------------------------------------------Load Module Summary ------------------------------------------------------------------------------% Total Cumulat Secs Msecs IP % of IP in Call per Samples Total Samples Module Count Call Load Module ------------------------------------------------------------------------------59.26 59.26 32 0.01 2168 0.00 libc.so.1 38.89 98.15 21 0.01 135 0.05 wordplay 1.85 100.00 1 0.00 8 0.04 dld.
9 Function Totals -----------------------------------------0 0x0000:0 M addp4 r32=r0,r32 :1 F nop.f 0x0 :2 I addp4 r33=r0,r33 0x0010:0 M nop.m 0x0 :1 F nop.f 0x0 :2 I nop.i 0x0;; 0x0020:0 M alloc r31=ar.pfs,0,8,0,8 :1 F nop.f 0x0 :2 I and r28=0x7,r33 0x0030:0 M and r30=0xfffffffffffffff8,r33 :1 I and r29=0x7,r32 :2 B brp.loop.imp {self}+0x180,{self}+0x190;; 1 0x0040:0 M cmp.eq.unc p15=r0,r32 :1 F nop.f 0x0 :2 I shl r21=r28,3 0x0050:0 M cmp.eq.unc p14,p13=r0,r33 :1 I mov r8=r32 :2 B (p15) br.ret.dpnt.
:2 ~5,0x2400:0 :1 :2 B M M I (p4) (p5) (p5) (p20) br.cond.dpnt.many {self}+0x2db0;; ld8.acq r8=[r37] ld8 r1=[r38] mov r1=r48;; ~ ~ ~ ~ ~ ~ ~ ~ 2 ~5,0x2430:0 :1 :2 ~5,0x2440:0 :1 :2 ~301 ~2,0x2450:0 :1 :2 M nop.m 0x0 B (p6) br.cond.dpnt.many {self}+0x870 B (p2) br.cond.dpnt.many {self}+0x3390;; M (p3) mov r47=1 M nop.m 0x0 B br.
--------------------------------------------------5.56 [wordplay::alphabetic, 0x4005b80, wordplay.c] 3 ~902 Function Totals -----------------------------------------[/home/meagher/wordplay.c] (0) ~902 >{ 0 ~1,0x0000:0 M alloc r43=ar.pfs,0,13,1,0 :1 M addl r38=-48,r1 :2 I mov r33=b0 (1) ~907 > for (i = 0; i < (int) strlen (s); i++) ~3,0x0010:0 M mov r42=r1 :1 M addl r9=160,r1 :2 I mov r45=r32;; ~ ~ ~ ~ ~ ~ ~ ~ ~909 > alphstr[pos++] = s[i]; ~7,0x0060:0 M adds r44=144,r44 :1 I mov b6=r8 :2 B br.call.sptk.
/ux/libsobj_i380em/libs/libc/shared_em_32/obj/../../../../../core/libs/libc/shared_em_32/../core/stdio/fgets.c] (0) 0 ~ ~ ~ ~ ~ ~ ~ ~ (1) 1 ~96 ~1,0x0000:0 :1 :2 ~1,0x0010:0 :1 :2 > ~49 ~1,0x00f0:0 :1 :2 ~1,0x0100:0 :1 :2 ~1,0x0110:0 :1 :2 *> M M I M M I M I I M I I M M B (p6) alloc mov mov adds addl addp4 r35=ar.
12.1 libc.so.1::strcpy [5] wordplay::main [2] *ROOT* [1] ---------------------------9.9 libc.so.1::strlen [3] wordplay::main [2] *ROOT* [1] ---------------------------9.2 libc.so.1::strlen [3] wordplay::alphabetic [6] wordplay::main [2] *ROOT* [1] ---------------------------8.8 libc.so.1::strlen [3] wordplay::uppercase [4] wordplay::main [2] *ROOT* [1] ---------------------------7.4 libc.so.1::toupper [8] wordplay::uppercase [4] wordplay::main [2] *ROOT* [1] ---------------------------5.
26.98 51/189 27 wordplay::extract [7] 72.49 137/189 72 wordplay::main [2] [5] 16.7 100.00 189 libc.so.1::strcpy [5] -----------------------------------------------------------------------100.00 39/39 100 wordplay::main [2] [6] 14.7 37.70 39 wordplay::alphabetic [6] 62.30 458/1478 31 libc.so.1::strlen [3] -----------------------------------------------------------------------100.00 44/44 100 wordplay::main [2] [7] 11.8 47.09 44 wordplay::extract [7] 38.12 51/189 27 libc.so.1::strcpy [5] 14.78 87/1478 6 libc.
2 19 20 21 3 4 main memmove mmap strcmp strlen uppercase 10 23 24 5 8 memccpy __milli_rem32U _mmap_sys strcpy toupper ---------------------------------------------------------------------Diagnostic Messages ---------------------------------------------------------------------+ Note: Multiple sampling counter variations are not available on HP-UX.
Diagnostic Messages The Diagnostic Messages appear at the end of the report. gprof Fallacy and Possibly Misleading Results The HP Caliper sampled call graph report (with the scgprof measurement) and the call graph report (with the cgprof measurement) both produce gprof-like reports. Thus, both these reports might produce misleading results regarding the amount of time spent under a function.
11 Producing a Sampled Call Stack Profile Analysis HP Caliper can produce a sampled call stack profile report (using the cstack measurement) from any compiled program. You do not need to compile your program in any special way to use this feature. HP Caliper periodically samples the application program counter and each of its thread's call stacks and then creates a call stack profile of the program's execution.
Figure 26 Call Stack Profile Text Report Example ================================================================================ HP Caliper A.4.4.
--------------------------------------------------------------------------------------------------------57.14 57.14 28 0 28 0 9 libpthread.so.1 40.82 97.96 20 0 20 0 0 libc.so.1 2.04 100.00 1 1 0 0 0 enh_thr_mutex1 --------------------------------------------------------------------------------------------------------100.00 100.
---------------------------------------------40.8 0.0 40.8 libc.so.1::__sigtimedwait_sys [8] libc.so.1::sigtimedwait [5] libc.so.1::_sleep [4] enh_thr_mutex1::foo [7] enh_thr_mutex1::start_routine [3] libpthread.so.1::__pthread_bound_body [2] ---------------------------------------------38.8 0.0 38.8 libpthread.so.1::___lwp_wait_sys [10] libpthread.so.1::_lwp_wait [11] libpthread.so.1::__vp_join [9] libpthread.so.1::pthread_join [12] enh_thr_mutex1::main [13] dld.
100.00 enh_thr_mutex1::start_routine [3] 0.00 libpthread.so.1::pthread_mutex_lock [16] 100.00 libpthread.so.1::*unnamed@0x404(1670-5b70)* [14] -------------------------------------------------------------------100.00 libpthread.so.1::_lwp_mutex_lock [15] [17] 18.4 0.0 18.4 100.00 libpthread.so.1::__lwp_mutex_lock_sys [17] -------------------------------------------------------------------[16] 18.4 0.0 18.
Block Hits Hits Name Hits Only Only ---------------------------------------------95.0 0.0 95.0 libpthread.so.1::___lwp_wait_sys [3] libpthread.so.1::_lwp_wait [4] libpthread.so.1::__vp_join [5] libpthread.so.1::pthread_join [6] enh_thr_mutex1::main [7] dld.so::main_opd_entry [1] ---------------------------------------------5.0 5.0 0.0 enh_thr_mutex1::main [7] dld.so::main_opd_entry [1] ---------------------------------------------[Minimum function entries: 5, percent cutoff: 1.
20.41 [(No source information) libc.so.1::__sigtimedwait_sys, 0x422ab40] 10 0 10 0 0 Function Totals ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------18.37 [(No source information) libpthread.so.
-------------------------------------------------------------------Function Indexes (Thread 6065598@start_routine) --------------------------------------------------Index Name Index Name --------------------------------------------------2 *ROOT* 6 foo 10 _lwp_mutex_lock 11 __lwp_mutex_lock_sys 3 __pthread_bound_body 8 pthread_mutex_lock 5 sigtimedwait 7 __sigtimedwait_sys 4 _sleep 1 start_routine 9 *unnamed@0x404(1670-5b70)* Load Module Summary (Thread 6065597@start_routine) --------------------------------
-------------------------------------------------------------------100.00 *ROOT* [7] [6] 100.0 0.0 100.0 0.00 libpthread.so.1::__pthread_bound_body [6] 100.00 enh_thr_mutex1::start_routine [5] -------------------------------------------------------------------[7] 100.0 0.0 100.0 0.00 *ROOT* [7] 100.00 libpthread.so.
Figure 27 Call Stack Profile Text Report Example for Linux ================================================================================ HP Caliper C.4.4.
-------------------------------------------------------------Function Summary (All Threads) ---------------------------------------------------------------------------------------% Total Cumulat WallSample IP % of Clock Hits Samples Total Samples Waiting Function File ---------------------------------------------------------------------------------------78.05 78.05 32 1 *kernel gateway* 21.95 100.00 9 0 enh_thr_mutex1::main enh_thr_mutex1.
[1] 100.0 0.00 *ROOT* [1] 51.22 libc.so.6.1::__clone2 [5] 48.78 enh_thr_mutex1::_start [8] -----------------------------------------------62.50 libc.so.6.1::__GC___libc_nanosleep [9] 34.38 libpthread.so.0::pthread_join [12] 3.12 libpthread.so.0::__lll_lock_wait [13] [2] 78.0 100.00 *kernel gateway* [2] -----------------------------------------------100.00 libpthread.so.0::start_thread [4] [3] 51.2 0.00 enh_thr_mutex1::start_routine [3] 95.24 enh_thr_mutex1::foo [11] 4.76 libpthread.so.
Function Details (Thread 31021@main) -----------------------------------------------------------------% Total WallSample Line| IP Clock Hits Slot| >Statement| Samples Samples Waiting Col,Offset Instruction -----------------------------------------------------------------26.83 [(No source information) *kernel gateway*, 0xa000000000000000] 11 0 Function Totals --------------------------------------------------------------------------------------------------------------------------21.
26.83 26.83 11 1 *kernel gateway* -------------------------------------------------------------26.83 26.83 11 1 Total -------------------------------------------------------------Function Summary (Thread 31024@start_routine) ---------------------------------------------------------------------------------------% Total Cumulat WallSample IP % of Clock Hits Samples Total Samples Waiting Function File ---------------------------------------------------------------------------------------26.83 26.
100.00 libc.so.6.1::__GC___libc_nanosleep [6] -----------------------------------------------100.00 enh_thr_mutex1::start_routine [2] [8] 90.9 0.00 enh_thr_mutex1::foo [8] 100.00 libc.so.6.1::sleep [7] -----------------------------------------------100.00 libpthread.so.0::pthread_mutex_lock [10] [9] 9.1 0.00 libpthread.so.0::__lll_lock_wait [9] 100.00 *kernel gateway* [1] -----------------------------------------------100.00 enh_thr_mutex1::start_routine [2] [10] 9.1 0.00 libpthread.so.
[4] 100.0 0.00 enh_thr_mutex1::foo [4] 100.00 libc.so.6.1::sleep [3] -----------------------------------------------100.00 libpthread.so.0::start_thread [6] [5] 100.0 0.00 enh_thr_mutex1::start_routine [5] 100.00 enh_thr_mutex1::foo [4] -----------------------------------------------100.00 libc.so.6.1::__clone2 [7] [6] 100.0 0.00 libpthread.so.0::start_thread [6] 100.00 enh_thr_mutex1::start_routine [5] -----------------------------------------------100.00 *ROOT* [8] [7] 100.0 0.00 libc.so.6.
Example 4 Sample cstack Report - Blocking Primitives Details Blocking Primitives Details (All Threads) -----------------------------------------------------------------------------------------------Sample Callpath Holder's % Total Sample Sample Sample Hits Index Kernel Hits Hits Hits Hits Blocking For Holder Holder Thread Waiting Waiting Spinning Blocked Primitive --For Waiter --Waiter ID -----------------------------------------------------------------------------------------------20.
Call Graph Part of the Report This section reports the call graph produced from the call stack samples. All the call graph entries—one for each function—are reported. Each entry has one or more lines and delimited by the line full of dashes. In each entry, the primary line is the one that starts with an index number in square brackets. The preceding lines in the entry describe the callers of this function. The lines following the primary line describe the callees of this function.
0 100.0 0.0 100.0 libc.so.1::__sigwait_sys libc.so.1::sigwait caliper::signal_monitor_thread_main libpthread.so.1::__pthread_bound_body ------------------------------------------------------------Hot Call Paths (Thread 859900@timers_thread_main) --------------------------------------------------------------% Total Hits In Only-Run + Run Block Index Block Hits Hits Name Hits Only Only ------------------------------------------------------------0 100.0 0.0 100.0 libc.so.1::_nanosleep_sys libc.so.
12 Performing CPU Metrics Analysis HP Caliper can measure and report per-process or system-wide metrics based on sampled CPU events. This is enabled by the cpu measurement. Specify the events and sampling period with the -m event_set and -s period options, respectively. You can measure multiple metrics in the same run. For most applications, the cpu measurement is the first measurement you should take when you begin using HP Caliper. Run this command: $ caliper cpu -o cpu.
13 HP Caliper Features Specific to HP-UX These features are available only when using HP Caliper on the HP-UX operating system: • These measurements: ◦ cgprof ◦ cpu See “Performing CPU Metrics Analysis • ◦ fcount ◦ fcover ” (p. 151). These command-line options: ◦ --bus-speed See “--bus-speed ◦ ” (p. 55). --cpu-aggregation See “--cpu-aggregation ◦ --cpu-details See “--cpu-details ◦ ” (p. 61). --exclude-idle See “--exclude-idle ◦ ” (p. 56).
If the HP Caliper run is made on a ccNUMA system, then the memory usage of every “logical domain” is separately measured and reported. If on an SMP system, then only the single, “local domain” is measured and reported. The system memory usage measurement is always taken if the --memory-usage= option is used. The measurement is made only once at the beginning of the HP Caliper run and the same data is reported for each process in a multiprocess run.
• --memory-usage= Causes process memory usage to be measured at the beginning, at the end, and every 1 second of the process's execution. (This is equivalent to --memory-usage all.) • --memory-usage=15s Causes no process memory usage measurement to be taken. Although it does specify a sampling rate, it does not request that “timed” measurements be made. The system memory measurement is still taken.
Physical Id Physical identification number of the logical domain. Cell local memory physical domains are numbered starting with 0. The interleaved memory physical domain is −1. Type Indicates whether this logical domain is physically configured as cell local memory (CLM) or interleaved (ILV) memory. # CPUs For each logical domain, indicates the number of CPUs that are sharing it. Used Pages Number of memory pages currently in use. Free Pages Current number of unused memory pages.
Domain Id System identification number of the logical domain. On ccNUMA systems, cell local memory domains are numbered starting at 1, and the interleaved memory domain Id is –1. On SMP systems, the only domain is numbered 0. Shared Pages Number of shared resident memory pages currently in use by this process. This is typically executable code for the program, shared libraries, and mmap'd (shared) regions.
Figure 29 Example System Usage Report Output System Usage - Run Status (All Threads) -------------------------------------------------------------------------------Relative -------- Time (thread secs) -------------- Percentage -------Time Running Eligible Waiting Running Eligible Waiting -------------------------------------------------------------------------------Overall 5.4534 0.0060 18.3617 22.89% 0.03% 77.
sigenable 132 1627.70 0.00000 0.00000 0.00000 0.00012 pstat 1 49.32 0.00010 0.00010 0.00010 0.00010 lwp_cond_broadcast 6 73.99 0.00000 0.00001 0.00006 0.00008 ttrace 1 49.32 0.00007 0.00007 0.00007 0.00007 open 6 295.95 0.00001 0.00001 0.00002 0.00007 ioctl 1 49.32 0.00004 0.00004 0.00004 0.00004 shmctl 2 98.65 0.00000 0.00002 0.00004 0.00004 brk 15 184.97 0.00000 0.00000 0.00000 0.00003 mpctl 16 789.19 0.00000 0.00000 0.00000 0.00003 sigaction 22 1085.13 0.00000 0.00000 0.00001 0.00003 close 10 493.24 0.
2. 3. Run ./myprog and find the process ID of the process. Specify the process you want to measure. For example: $ caliper fprof 7654 HP Caliper remains attached to the target process until it ends or you type Ctrl-C. If you type Ctrl-C to stop HP Caliper and generate a report, HP Caliper forcibly terminates all processes that are being measured.
sampling_counter = “NO_EVENT” If you don't change this setting, then the samples you have marked will be included with whatever sampling results HP Caliper is set to generate. You can instead run HP Caliper, specifying -s ,,NO_EVENT or -s "" on the command line. 5. Run your application under HP Caliper using that modified measurement configuration file: $ caliper my_pmu_trace myprogram Figure 31 (page 160) shows part of the resulting report.
This prevents the compiler from reordering statements while optimizing code, so the measured program results may be worse than it would be otherwise. For example, with sample points inside of a loop, this could mean that loop invariant promotion or other loop transformations become illegal or less effective. For sample points placed at the entrance and exit of functions, this could affect performance if the function is inlined.
NOTE: This feature is not intended to measure a small number of instructions. Enabling and disabling the PMU are not immediate operations and either operation might take a few processor cycles to be effective. Processor events occurring during those transitions might or might not be measured. Avoid using measurement windows so small that those uncertainties will significantly affect the reported numbers. To use this feature: 1.
Figure 32 Restricting PMU Measurement to Specific Code #include #include
A HP Caliper Diagnostic and Warning Messages This appendix describes some diagnostic and warning messages you might receive. HP Caliper always attempts to measure everything that you request. When this is not possible, however, HP Caliper gives you diagnostic or warning messages. You can usually safely ignore these messages. Several situations can cause these messages: • A sampled address is outside the measurement context. • A function contains specialized assembly code.
Figure 33 Mispredicted Branches Example Function Details ---------------------------------------------------------------------------------------------% Total Target Line| Taken of Branch Branch Taken NTaken % Slot| >Statement| Mispr Branch Taken NTaken Mispr Mispr Mispr Col,Offset Instruction ---------------------------------------------------------------------------------------------25.00 [libc.so.1::__thread_mutex_lock, 0x40000000002123a0, wrappers1.c] 2 2 0 1 0 50.
---------------------------------------------------------------------------------------------[Minimum function entries: 0, percent cutoff: 1.00, cumulative percent cutoff: 100.00] By using a custom HP Caliper script, you can restrict the branch-trace buffer to only include branches with specific prediction results, both for target prediction and taken/not-taken prediction.
On HP-UX, sampled call graph reports require kernel patch PHKL_34020. To install this patch, check the HP IT Resource Center for availability and download information. Email the HP Caliper team at caliper-help@cup.hp.com if you have questions about this patch.
B Descriptions of Measurement Reports This appendix contains descriptions of reports produced for each HP Caliper measurement. It shows example command lines you can use to produce the reports and describes the data available with the measurements.
alat Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper. Metrics for Integrity Servers Itanium 2 Systems INST_CHKA_LDC_ALAT.ALL INST_FAILED_CHKA_LDC_ALAT.ALL ALAT_CAPACITY_MISS.ALL Data speculation miss percentage The number of advance check load (chk.a) and check load (ld.c) instructions that reached retirement, including both integer and floating-point instructions. The number of failed advance check load(chk.
INST_CHKA_LDC_ALAT.FP The number of all advanced check load (chk.a) and check load (ld.c) instructions that reach retirement. Counts only retired floating-point instructions. INST_CHKA_LDC_ALAT.INT The number of all advanced check load (chk.a) and check load (ld.c) instructions that reach retirement. Counts only retired integer instructions. INST_FAILED_CHKA_LDC_ALAT.FP The number of failed advanced check load (chk.a) and check load (ld.c) instructions that reach retirement.
Table 9 Information in alat Measurement Reports Column Description % Total Sampled ALAT Misses Percent of the total for attributable to a given program object. The is the same as the HP Caliper uses for sorting, except when the sort metric is address, in which case sampled misses is used. Cumulat % of Total A running sum of the percent of total for accounted for by the given program object and those listed above it.
Command-line options allow you to control the amount of data reported, how the data are sorted, and the number of statements and instructions reported for each sampled program location. Example Command Line for Text Report $ caliper branch -o brp.txt ./wordplay thequickbrownfox Example Command Line for CSV Report $ caliper branch --csv csvout ./wordplay thequickbrownfox branch Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper.
• BR_MISPRED_DETAIL.NRETIND.WRONG_PATH Number of non-return indirect branches mispredicted due to wrong branch direction. • BR_MISPRED_DETAIL.NRETIND.WRONG_TARGET Number of non-return indirect branches mispredicted due to wrong target for taken branch. • BR_MISPRED_DETAIL.RETURN.CORRECT_PRED Number of correctly predicted (outcome and target) return branches retired. • BR_MISPRED_DETAIL.RETURN.WRONG_PATH Number of return branches mispredicted due to wrong branch direction. • BR_MISPRED_DETAIL.
• Percent ret correct predictions Percentage of return branches that predicted correctly. • Percent ret wrong paths Percentage of return branches that mispredicted the branch predicate. • Percent ret wrong branch targets Percentage of return branches that mispredicted the branch target. • % of cycles lost due to branch misprediction or exception/interruption flush Percentage of cycles lost due to either an exception/interruption or a branch misprediction flush.
Table 10 Information in branch Measurement Reports (continued) Column Description Load Module Shared library or the main executable. Function Routine from your application. File Source file associated with a function.
HP Caliper Call Graph Profile Results Accuracy The HP Caliper call graph profile report directly measures call graph data. The number of calls are derived by counting, not sampling. They are completely accurate and will not vary from run to run if your program is deterministic. The HP Caliper call graph profile report's sampling of IP data is statistical, so you should expect a small variation (less than +/- 5%) in the timing data that HP Caliper collects for different runs of your application.
Table 11 Information in cgprof Measurement Report Fields (Flat Profile) (continued) Column Description Line | Slot | Col,Offset The column contains one of these: • A source-code line number for rows showing statements • An instruction slot number for rows showing instructions not on a bundle boundary • A source-code column number followed by an offset from the beginning address of a function for rows showing instructions on a bundle boundary Column and line numbers are preceded by “~” when they are appro
Table 14 Information in cgprof Measurement Report: Parent Listings (continued) Column Description Parents Name of this parent function. Cycle Cycle that this parent is a member of, if any. *This field is omitted for parents, or children, in the same cycle as the function. If the function, or child, is a member of a cycle, the propagated times and propagation denominator represent the self time and descendant time of the cycle as a whole.
◦ MINIMUM — Minimum value across all samples. ◦ LOW90 — Lowest value in the 90% confidence interval. This indicates that, statistically, 90% of the time, the mean value will be higher than or equal to this value. ◦ HIGH90 — Highest value in the 90% confidence interval. This indicates that, statistically, 90% of the time, the mean value will be lower than or equal to this value. ◦ MAXIMUM — Maximum value across all samples.
dspec fp l1dcache l1icache l2cache l2dcache l2icache l3cache overview Provides metrics on the effectiveness of data speculation. Provides information relating to floating-point operation density, execution rate, and flush/trap events density. Provides miss rate information for the L1 data cache. Provides miss and prefetch usage information for the L1 instruction cache. Provides miss rate information for the L2 unified cache.
stall sysbus threadswitch tlb Provides metrics on primary CPU performance limiters by breaking the CPI into seven components. Provides metrics on system bus utilization. If you specify the sysbus event set, you must use the --bus-speed option to provide bus speed in MHz. For example: --bus-speed 200. Provides data about the effect of HyperThreading on the measured processes for Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processor systems.
Table 16 Information in cstack Measurement Report Fields (Flat Profile) (continued) Column Description Wall-clock Samples Total number of direct sample hits attributed to the given object. (Linux only) Sample Hits Waiting Number of sample hits taken when a thread was waiting (divided into blocked and spinning) on a thread level blocking primitive (mutex, read/write lock, or condition variable) or process (HP-UX only) level blocking primitive (semaphore, message queue, socket, pipe, file descriptor).
Table 19 Information in cstack Measurement Report Fields (Call Graph Profile) Column Description Index Index of the function in the call graph listing, as an aid to locating it. % Total Hits In/Under Run + Block Hits Percentage of the total sample hits in or under function; run and blocked hits combined. (HP-UX only) % Total Hits In/Under Run Hits Only Percentage of the run sample hits in or under function.
The report shows measured data by thread, load module, function, source statement, and instruction bundle. Command-line options allow you to control the amount of data reported, how the data is sorted, and the number of statements and instructions reported for each sampled program location. Example Command Line for Text Report $ caliper cycles -ra -o reports/sample.txt ./wordplay thequickbrownfox Example Command Line for CSV Report $ caliper cycles --csv csvout .
% of Cycles lost due to Pipeline flush stalls (lower is better) % of Cycles lost due to data access stalls (lower is better) % of Cycles lost due to RSE stalls (lower is better) % of Cycles lost due to Scoreboard stalls (lower is better) % of Cycles lost due to register load stalls (includes FR/FR stalls) % of Cycles lost due to FR/load or FR/FR dependency stalls % of Cycles lost due to GR/load dependency stalls % of Cycles lost due to stalls in L1D cache and L1/L2 DTLB % of Cycles lost due to register depe
Table 20 Information in cycles Measurement Reports (continued) Column Description Function Routine from your application. File Source file associated with a function. Cycles Per Bundle The average number of cycles elapsed to retire the bundle. If there are no stalls, it should take exactly one cycle to retire a bundle. If the Cycles Per Bundle information is more than 1, this means that many additional cycles of stall were seen on that bundle.
The sampled metrics also provide detailed latency information by breaking up the misses into eight different latency buckets based on latency cycles. The different buckets provide percentage of misses with different latency ranges. A latency bucket is a grouping of latency data associated with data accesses serviced by particular levels of CPU cache and system memory. The different latency buckets can be one of the following: L2 cache access, L3 cache access, and memory access.
DATA_REFERENCES The number of data memory references issued into memory pipeline. Includes check loads, non-uncacheable accesses, RSE operations, semaphores, and floating-point memory references. The count includes wrong path operations but excludes predicated off operations. This event does not include VHPT memory references. L1 Data Cache Miss Percentage Percentage of L1 data cache reads that are misses. Percent of Data References Accessing Percentage of data references that access the L1 data cache.
L2D_REFERENCES.
Table 21 Information in dcache Measurement Reports (continued) Column Description Latency Buckets as The latency data is reported under eight different buckets: three for cache information and five for % Misses memory information. The top row(s) of the heading specifies the names of the cache level (such as L2 or L3) and system memory names. For example, in Example 5, cache levels L2 and L3 are shown and the system memory is shown as simply Memory (spanning five buckets).
Example 5 Example of a dcache Report for an rx4640 Integrity server Function Details --------------------------------------------------------------------------------------------------% Total Avg. ---Latency buckets as % Misses-Dcache Sampled Dcache Dcache L2 --L3-- ------Memory------Line| Latency Dcache Latency Laten.
Data Summary --------------------------------------------------------------------------------------------------------------% Total Avg. ---Latency buckets as % Misses-Dcache Cumulat Sampled Dcache Dcache L2 --L3-- ------Memory------Latency % of Dcache Latency Laten. Cycles Total Misses Cycles Cycles 7 14 64 150 250 350 450 > Data Entry --------------------------------------------------------------------------------------------------------------66.82 7.72 66.82 74.54 42 10 580 67 13.8 6.
misses a particular instruction incurred, but, instead, as an indication of which instructions incur the most data cache misses. You can potentially get a rough estimate of the total number of data cache misses incurred by a particular instruction, for example, by doing the following: 1. Determine a scaling factor based on total misses and number of misses accounted for by sampling: scale = total L1 misses / (total sampled misses * sampling rate) 2.
Example Command Line for CSV Report $ caliper dtlb --csv csvout ./wordplay thequickbrownfox dtlb Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper.
DTLB_INSERTS_HPW IA64_INST_RETIRED L1DTLB_TRANSFER L1D_READS L2DTLB_MISSES % of Cycles lost due to all stalls (lower is better) % of Cycles lost due to GR/load dependency stalls (lower is better) % of Cycles lost due to GR/GR dependency stalls (lower is better) % of Cycles lost due to FR/load and FR/FR dependency stalls (lower is better) Total L1 data TLB references L1 data TLB for L1D miss percentage L2 data TLB misses L2 data TLB miss percentage Percentage of L2 DTLB misses covered by the HPW Percentage
• Function • Source statement • Instruction Table 22 Information in dtlb Measurement Reports Column Description % Total Percent of the total for attributable to a given program object. The is the same as the HP Caliper uses for sorting, except when the sort metric is address, in which case sampled misses is used. Cumulat % of Total Running sum of the percent of total for accounted for by the given program object and those listed above it.
hierarchy that satisfied the miss, L2 data TLB, HPW, or software. You can override the value in the measurement configuration file by using the -s option. More frequent sampling increases HP Caliper's perturbation of your application. In the extreme case of taking one sample for each TLB miss event, the kernel will trap on every event, making the resulting data of limited value.
Metrics for Integrity Servers Dual-Core Itanium 2 and Itanium 9300 Quad-Core Processor Systems The following CPU events are directly measured: • BACK_END_BUBBLE.ALL — The number of cycles when the back end of the pipeline was stalled. This is the number of cycles lost (stall cycles) due to any of five possible events (FPU/L1D, RSE, EXE, branch/exception, or the front end). • BE_EXE_BUBBLE.GRALL — The number of Full Pipe Bubbles in Main Pipe due to GR/GR or GR/load dependency stalls.
• % of thread switches due to L3 misses — The hardware thread switches can happen due to various reasons including L3 cache misses and timer events. This metric provides the percentage of thread switches due to L3 cache misses. • % Core cycles due to this thread — This indicates the percentage of available processor cycles that the measured process consumed. The other processor cycles were consumed by other process(es) running in the core's other hyperthread or were lost to HyperThreading overhead.
Table 24 Information in fcover Measurement Reports Column Description Reached Number of functions in the given source file that executed at least once. Unreached Number of functions in the given source file that did not execute. %Unreached Percentage of functions in the given source file that were never executed. Load Module Load module name in Load Module Summary. Source Directory Source directory path in Source Directory Summary. Source File Source-file path in Source File Summary.
The report contains two levels of information: • Exact counts of CPU metrics summed across the entire run of an application • Sampled IPs that are associated with particular locations in the application The default for the fprof measurement is to take a sample every 500,000 +/- 25,000 CPU cycles. (CPU_CYCLES is the event.) You use the -s (--sampling-spec) option to change both the event being sampled and the interval.
BE_L1D_FPU_BUBBLE.L1D BE_RSE_BUBBLE.ALL CPU_CPL_CHANGES.ALL CPU_OP_CYCLES.ALL Full Pipe Bubbles in Main Pipe due to L1D cache. This is the number of cycles lost (stall cycles) due to L1D cache and L1/L2 DTLB. Full Pipe Bubbles in Main Pipe due to RSE stalls. Percentage of cycles lost due to stalls in RSE spilling/filling registers to/from memory. Number of Privilege Level Changes to/from all privileges. Number of elapsed CPU operating cycles. (Note: This event is called CPU_CYCLES on Itanium 2 systems.
In this table, “program object” refers to any of the following: • Thread • Load module • Function • Source statement • Instruction bundle Table 26 Information in fprof Measurement Reports Column Description % Total IP Samples Percent of the total IP samples attributable to a given program object. Cumulat % of Total Running sum of the percent of total IP samples accounted for by the given program object and those listed above it.
that caused the PMU overflow will have occurred some number of cycles, typically in the low tens, before the address being sampled. Thus, the address recorded might or might not point to the instruction causing the event, depending on pipeline stalls. The latency between the event triggering the sample and the actual sample is not a problem if you are using fprof to find hot spots in your application.
L1I_PREFETCHES Provides information about the number of issued L1 cache line prefetch requests (64 bytes/line). The reported number includes streaming and non-streaming prefetches. Hits and misses in L1 instruction cache are both included. Number of L1 instruction cache read misses. Percentage of demand fetch reads that missed. L1 Instruction Cache Read Misses L1 Instruction Cache Demand Miss Percentage Total L1 Instruction Cache References Sum of demand fetch reads and L1 cache line prefetch requests.
L1 instruction prefetch miss percentage L1 instruction demand miss percentage L2 instruction demand misses L2 instruction prefetch misses L2 instruction cache miss percentage Percentage of L1 instruction prefetches that are misses. Percentage of L1 instruction demand fetches that are misses. Number of L2 instruction demand fetch misses. Number of L2 instruction prefetch misses. Percentage of L2 instruction demand fetches and prefetches that are misses.
Table 27 Information in icache Measurement Reports (continued) Column Description Load Module Shared library or the main executable. Function Routine from your application. File Source file associated with a function.
itlb Measurement Report Description With the itlb measurement, produced by the itlb measurement configuration file, HP Caliper measures and reports two levels of information: • Exact counts of instruction translation lookaside buffer (TLB) metrics summed across the entire run of an application • Sampled instruction TLB metrics that are associated with particular locations in the application The report shows masured data by thread, load module, function, statement, and cache line.
BE_LOST_BW_DUE_TO_FE.IMISS BE_LOST_BW_DUE_TO_FE.TLBMISS CPU_OP_CYCLES.ALL IA64_INST_RETIRED ITLB_MISSES_FETCH.L1ITLB ITLB_MISSES_FETCH.
• Function • Statement • Cache line Table 28 Information in itlb Measurement Reports Column Description % Total Percent of the total for attributable to a given program object. The is the same as the HP Caliper uses for sorting, except when the sort metric is address, in which case sampled misses is used. Cumulat % of Total Running sum of the percent of total for accounted for by the given program object and those listed above it.
Non-contiguous cache lines are separated by a row of tildes (“~ ~ ~ ~”). How Instruction TLB Metrics Are Obtained HP Caliper obtains instruction TLB metrics from the processor's performance monitoring unit (PMU). Exact counts are obtained from the PMU's performance monitor configuration (PMC)/performance monitor data (PMD) register pairs. Sampled instruction TLB metrics are obtained from the PMU's instruction event address register (I-EAR).
scgprof Measurement Report Metrics (Flat Profile) Table 29 (page 212) shows the information found in Sampled Call Graph Profile reports. In this table, “program object” refers to any of the following: • Load module • Function • Source statement • Instruction bundle The numbers of calls in the descriptions are smaller than the actual calls made by the application, because the call graph is produced by sampling.
Table 30 Information in scgprof Measurement Report: Function Entries (Self Entries) (continued) Column Description % Func Hits In Func Number of hits due to this function, expressed as a percentage of the number of hits accounted for by this function and its descendants. Called Number of times this function is called, other than recursive calls. If this is a cycle entry, this means the number of times the members of this cycle are called from functions that are not members of this cycle.
Table 33 Information in scgprof Measurement Report: Children Listings Column Description % Func Hits In Children Number of hits due to this child entry and its descendants, expressed as a percentage of the number of hits accounted for by the self entry and its descendants. Called* Number of times the function represented by the self entry was called by this child. If this column contains a hyphen (-), this means that there is at least one call, but the exact number of calls is unknown.
DARGHT DBIT DEBUG DFPREG DKEY DNTLB DTLB FPFLT FPTRP GEXCP Data access rights fault Data dirty bit fault Debug fault Disabled floating-point register fault Data key miss fault Data nested translation lookaside buffer fault Data translation lookaside buffer fault Floating-point Fault Floating-point Trap General exception: Unimplemented data address fault Illegal operation fault Illegal dependency fault Privileged operation fault Reserved register/field fault IA32EXP IA32 Exception IACCS Instruction access
traps Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper. BACK_END_BUBBLE.FE Full pipe bubbles in main pipe due to front end. This is the number of cycles lost (stall cycles) due to instruction cache, ITLB, and branch execution stalls. BE_EXE_BUBBLE.ALL Full pipe bubbles in main pipe due to execution unit stalls. This is the number of cycles lost (stall cycles) due to stalls caused by the execution unit. BE_EXE_BUBBLE.
% of Cycles lost due to data access stalls (includes FR/FR stalls) % of Cycles lost due to RSE stalls % of Cycles lost due to Scoreboard stalls (excludes FR/FR stalls) % of Cycles lost due to register load stalls (includes FR/FR stalls) % of Cycles lost due to FR/load or FR/FR dependency stalls % of Cycles lost due to GR/load dependency stalls % of Cycles lost due to stalls in L1D cache and L1/L2 DTLB % of Cycles lost due to register dependency stalls (excludes FR/FR stalls) % of Cycles lost due to GR/GR de
Table 34 Information in traps Measurement Reports (continued) Column Description File Source file associated with a function.
C Event Set Descriptions for CPU Metrics This appendix contains descriptions for the output of each event set available when you use the cpu measurement. NOTE: The information provided in this appendix for each report description is the same information you receive when you use the --info option to append help to the end of text reports, or when you use this command: $ caliper info -r event-set For more information, see “cpu Measurement Report Description ” (p. 178).
• ◦ %Indirect Branch ◦ This metric provides the percentage of Indirect branches among all branches. ◦ %Return Branch This metric provides the percentage of Return branches among all branches. IPREL Path Statistics This metric provides path distribution and mispredict rate for both paths of a non-call IPREL branch. Unconditional IPREL branches are included, so there is a slight bias toward the taken path.
The metrics are: • • • Overall This metric provides the branch prediction outcome breakdown (correct, wrong path, wrong target) for all branches irrespective of the branch type. Predicated off branches that were predicted as taken will be counted as wrong path branch outcomes. ◦ Correct Percentage of correctly predicted branches (all types). ◦ Wrong Path Percentage of branches (all types) for which the target path (taken/not-taken) was predicted incorrectly.
• ◦ Wrong Path Percentage of Indirect branches for which the target path (taken/not-taken) was predicted incorrectly. ◦ Wrong Target Percentage of Indirect branches for which the target address was predicted incorrectly. Return This metric provides the branch prediction outcome breakdown (correct, wrong path, wrong target) for return branches. Predicated off returns that are predicted as taken will be counted as wrong path outcomes. ◦ Weight Fraction of Return branches amongst all branch types.
• Avg Snoop Requests This is the average number of live snoop responses that reside in the snoop request queue per cycle. • C2C/Snoop This is the fraction of snoops that local processor detects that it has a modified version of the data the a remote processor has requested as a result of a data cache miss. It does not include implicit writebacks as a result of a modified hit on a line that is being flushed in response to an fc instruction.
Metrics Available from this Measurement The following metrics are available from this event set. These descriptions do not take into account any command-line options you might use. The metrics are: • Cycles This is the total number of CPU cycles collected during the measurement sample period. • IA64 Instr This is the total number of IA64 instructions retired during the measurement sample period.
misses and excessive speculation control and data speculation fails. An estimate of any bias introduced by these events can be developed from information available in the tlb, cspec, and dspec event sets. cpubus Event Set Available only on Itanium 2 and dual-core Itanium 2 systems.
• Snoops This is the total number of snoops per second that the local processor observes as a result of data cache misses of remote processors and local processor self snoops. • Hitm (hit a modified) This is the number of implicit writebacks sourced by the local processor in response to data misses by remote processors referencing a line that is modified in the local processors cache. cspec Event Set The cspec event set provides information on the effectiveness of control speculation.
• Chks Failed This is the total number of failed chk.s instructions that were retired during the sample interval. • Control Speculation: ◦ Spec/Sec: Total This is the total number of control speculation events per second. ◦ Spec/Sec: Fail This is the number of control speculation fail events per second. ◦ Spec/Kinst: Total This is the total number of control speculation events per 1000 retired instructions. The instruction count includes predicated off and nop instructions.
• Explicit - Instructions not dispersed This is a count of the number of instructions that were not dispersed due to explicit stop bits. Explicit stop bits are used to separate bundles (three instructions) within a bundle group (two bundles of three instructions each) or to separate bundle groups. Explicit stops bits can also be found within bundle-specific templates that contain embedded stop bits, that is, M_II. The default mode will include all dispersal cycles.
recovery code as well as the architecturally visible instruction. You can eliminate idle loops effects by using the command-line option --exclude-idle True (which is the default). The effects of failed speculative operations and TLB misses cannot be directly eliminated, but you can get an estimate of the impact of events from the cspec, dspec, and tlb event sets.
metric will be close to zero. High values would tend to suggest that the PBO information, used by the optimizer when creating the binary code, might have been invalid. • %ALAT Miss This is the percentage of the number of times that the ALAT does not have any information regarding a memory address (misses) out of the total number of times the ALAT is accessed. Instructions that access the ALAT include ld.a, ld.sa, ldf.a, ldf.sa, and ld.c.nc.
FPMIN FPMAX FPAMIN FPAMAX FPCMP FPCVT.
• FP Events/Sec: SIR Event: stall This is the total number of SIR false stalls (stall only, no trap taken) observed per second. • FP Events/Sec: SIR Event: trap This is the total number of SIR true stalls (SWFA trap taken) observed per second. • FP Events/Fop: zero flush This is the number of flush to zero events that occur per floating-point operation (not per instruction). • FP Events/Fop: SIR Event: total This is the ratio of all SIR stalls and total floating-point operations (not instructions).
The metrics are: • Total - Misses per Sec This is the total number of L1D cache misses per second. • NON RSE - Misses per Sec This is the number of non-RSE L1D cache misses per second. • RSE - Misses per Sec This is the number of RSE load L1D cache misses per second. • Total - Misses per Kinst This is the total number of L1D cache misses per 1000 retired instructions retired, including nops, predicated off instructions, and speculative instructions/associated recovery code.
measurement. You can use command-line options to limit the scope of the measurement. Specifically, you can: • Limit measurement to a specific privilege level: -m event_set[:all|user|kernel] • Include idle: --exclude-idle False • Exclude the interruption state: --measure-on-interrupts off • Only measure the interruption state: --measure-on-interrupts only The event per kinst (event per 1000 instructions) metrics are computed using all instructions retired.
(McKinley, Madison, and Deerfield), this should be approximately equal to the L1I cache fill rate. • %ISB Line Usage This is the percentage of ISB lines that are actually delivered to the L1I cache. For the Itanium 2 family of processors (McKinley, Madison, and Deerfield), this fraction will be at or slightly less than 100%.
Metrics Available from this Measurement The following metrics are available from this event set. These descriptions do not take into account any command-line options you might use. The metrics are: • Total - Misses Per Second This is the total number of L2 cache misses per second. It includes all instruction prefetch misses, instruction demand misses, and data misses.
• Writebacks Per Kinst This is the total number of L2 cache writebacks (L3 hit and miss) per 1000 retired instructions, including nops and predicated off instructions. • Instr Per Access This is the ratio of the total number of instructions retired per L2 cache access, including nops and predicated off instructions.
The metrics are: • Total - Misses Per Second This is the total number of L2 data cache misses per second. It includes all data load and store misses. • Load - Misses Per Second This is the number of data load requests that miss the L2 cache per second. • Store - Misses Per Second This is the number of data store requests that miss the L2 cache per second. • Writebacks Per Second This is the total number of L2 data cache writebacks (L3 hit and miss) per second.
The L2 instruction cache metrics include miss information for instruction prefetch requests and instruction demand requests. There are a number of issues regarding L2 instruction cache access that need to be considered when interpreting L2 cache measurement results. The L2 cache will not count fetches to the second half of a line if the fetch for the first part is already counted. Secondary misses are counted as data references. Only requests that have entered the OZ queue are counted.
• Instr Per Access This is the ratio of the total number of instructions retired per L2 instruction cache access, including nops and predicated off instructions. The L2 instruction cache accesses include demand fetches and prefetches that miss the L1 instruction cache. • %Miss - Total This is the percentage of all the L2 instruction cache misses out of the total number of L2 instruction cache accesses. Accesses include instruction fetches/prefetches that miss the L1 instruction.
• Dfetch - Misses Per Second This is the number of instruction line demand requests that miss the L3 cache per second. • Data - Misses Per Second This is the number of data (load and store) requests that miss the L3 cache per second. This count includes writebacks from the L2 cache that miss the L3 cache. • Writebacks Per Second This is the total number of L2 cache writebacks (L3 hit and miss) per second.
memreq Event Set Available only on Itanium 9300 quad-core processor systems. The memreq event set provides data about memory read latency and cacheable and uncacheable memory requests. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement. You can use command-line options to limit the scope of the measurement.
• WB - 64 This is total number of cacheable 64-byte write backs per 1000 retired instructions, including nops and predicated off instructions. • Instr This is total number of uncacheable instruction (prefetch and demand) fetches per 1000 retired instructions, including nops and predicated off instructions. • Load This is total number of uncacheable loads per 1000 retired instructions, including nops and predicated off instructions.
and values approaching the depth of the BRQ queue (16) indicate a system under considerable stress. • AVG BRQ Latency This is the average number of cycles that a request resides in the BRQ. It is also useful for interpreting system loading. Large values (> 20 cycles) indicate that the processor is being delayed during bus request arbitration, probably due to excessive bus utilization by a priority agent (I/O).
• 128 Byte - Miss This is the fraction of 128-byte data snoops that miss, out of all data snoops (64-byte and 128-byte). • 128 Byte - Hit This is the fraction of 128-byte data snoops that hit a cache line, out of all data snoops (64-byte and 128-byte). • 128 Byte - Hitm This is the fraction of 128-byte data snoops that hit a modified cache line, out of all data snoops (64-byte and 128-byte).
• Exclude the interruption state: --measure-on-interrupts off • Only measure the interruption state: --measure-on-interrupts only Metrics Available from this Measurement The following metrics are available from this event set. These descriptions do not take into account any command-line options you might use. The metrics are: • Raw CPI The raw CPI is computed using all instructions retired. This includes nops and predicated off instructions.
the HPW will terminate and initiate a trap to software to provide the required TLB entry. This component counts the stall component only due to the HPW providing the required TLB entry. Time spent in the software trap handler is not counted in this component. • Dcache This counts the number of cycles stalled due to data cache misses at any level of the cache hierarchy (L1, L2, L3). Due to event limitations, it is not possible to distinguish between freg-freg and freg-load dependencies.
The average memory read latency on the dual-core Itanium 2 processor will appear greater than on previous Itanium 2 processors. This is because the reported latency also includes the latency that the arbiter adds to both the outbound request and inbound data transfer. • Avg Outstand Average number of outstanding reads per cycle gives some idea of the memory request density, that is, the probability of one or more memory requests per cycle.
• BWL Bus Writeback Line is used when a dirty cache line is replaced as a consequence of servicing a BRL or BRIL bus transaction. • BRC This is the number of current memory read transactions on the bus. • BIL Bus Invalidate Line is used to cause lines to be flushed from the cache. Since Itanium 2 does not implement the BIL optimization, this can only be generated by the fc (flush cache) instruction.
The metrics are: • TS Per Sec Number of thread switches each second. • TS Per Kinst Number of thread switches every 1000 instructions. • L3miss Percentage of all thread switches that were caused by a miss in the Level 3 cache. A large value indicates a “good” use of HyperThreading: while this process is waiting on memory, another process can execute. • Timer Percentage of all thread switches due to the “fair share” timer.
The Itanium 2 TLB implementation is split for instructions and data, with two levels for each. The first level only maps 4K pages. Thus, the miss rate (per sec/per kinst) might be quite high. The second level supports large pages and is backed up by hardware that automatically inserts the required translation if it is found to be the head element on the page table list. The hardware insertion hardware does not traverse the page table list.
• I1TLB Misses Per Kinst This is the number of level 1 ITLB misses per 1000 retired instructions. The retired instruction count includes predicated off and nop instructions. Level 1 ITLB misses are normally satisfied by the level 2 ITLB. • I2TLB Misses Per Kinst This is the number of Level 2 ITLB misses per 1000 retired instructions. The retired instruction count includes predicated off and nop instructions. Misses at this level of the ITLB are initially attempted to be satisfied by the HPW.
Glossary advance load address table (ALAT) In the Integrity servers processor family, a table that keeps track of speculative (that is, advance) loads. An excessive number of ALAT compares that result in a failed advance load (an ALAT miss) can seriously degrade performance. advice class A grouping for advice from the Advisor. Every piece of advice belongs to one of these classes: general, CPU, memory, IO, and system.
data speculation The execution of a memory load prior to a store which preceded it and which might potentially alias with it. Data speculation loads are also referred to as advance loads. See “dspec Event Set” (p. 228). databases directory The directory where output databases are created for each data collection run of HP Caliper, unless you use the -d option. By default, the databases directory is a directory called .hp_caliper_databases in your current directory.
hot spot An instruction or set of instructions that has a higher execution count than most other instructions in a program. HP Caliper Advisor A rules-based expert system that gives guidance about improving the performance of an application. See “Using the HP Caliper Advisor” (p. 76). HP Caliper option A parameter in the HP Caliper command line used to customize the performance analysis. See “HP Caliper Options” (p. 47).
measurement configuration file A file that HP Caliper uses to perform a particular measurement, such as scgprof or icache. Each measurement has a corresponding measurement configuration file. See “HP Caliper Measurement Configuration Files” (p. 42). measurement run folder In the HP Caliper GUI, a folder that contains information about the types of data available for a single measurement run. It can also contain the collection specification used to collect the data in the folder.
sampled measurement A measurement that measures your program's performance at regular intervals, based on CPU events, recording the current program location and selected performance metrics. See “Sampled Measurements” (p. 26). scgprof measurement A measurement, provided by the scgprof measurement configuration file, that measures and reports (an inexact) call graph profile, produced by sampling the performance monitoring unit (PMU) to determine function calls.
Index Symbols --[no]fold option, 61 --advice-classes option used with HP Caliper Advisor, 79 --advice-cutoff option used with HP Caliper Advisor, 79 --advice-details option used with HP Caliper Advisor, 79 --analysis-focus option used with HP Caliper Advisor, 79 --branch-sampling-spec option, 54 --bus-speed option, 55 --callpath-cutoff option, 55 --context-lines option, 56 --cpu-aggregation option, 56 --cpu-counter option used with caliper info command, 102 --cpu-details option, 56 --cpu-metrics-aggregation
-p some option syntax, 98 -r option, 51 -s option, 52 used with caliper info command, 102 -t option, 74 -w option, 54 .
Dual-core Itanium 2 processor HyperThreading information, 112 E ecount measurement report description, 197 Enabling the PMU, 161 Environment variables HP Caliper, 103 Error messages, 164 Event name abbreviation error, 93 Event name abbreviations showing, 94 Event set descriptions for cpu measurement, 219 Event sets brpath, 219 brpred, 220 c2c, 222 cpi, 223 cpubus, 225 cspec, 226 dispersal, 227 dspec, 228 fp, 230 l1dcache, 232 l1icache, 233 l2cache, 235 l2dcache, 237 l2icache, 238 l3cache, 240 memreq, 242 q
Measurement global, 26 precise, 26 sampled, 26 Measurement configuration file, 42 Measurement configuration files Overview measurement, 44 provided with HP Caliper, 42 Simultaneous fprof sampling on multiple PMU Counters, 45 using, 45 Measurement types, 44 Measurements types you can take, 26 Measuring load modules, 94 default settings for, 95 Memory usage measuring concurrently, 152 memreq event set, 242 merge command see caliper merge command Merging performance data, 114 Metrics used for sorting and cutof
Showing HP Caliper options, 28 Simultaneous fprof sampling on multiple PMU Counters, 45 snoop event set, 244 Sorting metrics used for, 105 Source line data shown in reports, 110 Source position correlation, 110 Source statements omitting from reports, 51 Source, adding to report, 24 Specifying modules, 95 Specifying option values with a .