HP Compilers for HP Integrity Servers HP Part Number: 5900-1863 Published: September, 2011 Edition: 3
© Copyright 2002-2011 Hewlett-Packard Development Company, L.P. Legal Notices The information in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this manual, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose.
Contents HP compilers for HP Integrity servers.............................................................................................4 Understanding HP compilers......................................................................................4 Optimizing for Integrity servers...................................................................................5 Predication.........................................................................................................5 Control speculation..
HP compilers for HP Integrity servers This document provides a technical overview of the key features of HP compilers for HP Integrity servers running the HP-UX 11i v3 operating system. Understanding HP compilers HP Integrity servers use the Intel® Itanium® architecture, co-developed by HP and Intel, which uses Explicitly Parallel Instruction Computing (EPIC).
extensive use of sophisticated Itanium processor family features such as predication, speculation, and data prefetching. Figure 1 Internal structure of the HP compilers Optimizing for Integrity servers The Intel Itanium architecture seeks to reduce execution time by maximizing instruction-level parallelism—the concurrent execution of multiple instructions.
Example 1 Using predication if (a == 0) { x = 5; } else { x = *p; } The compiler can use predication to transform control dependencies on branch instructions into data dependencies on compare instructions. Example 2 Code from Example 1 generated using branches cmp.ne.unc (p1) br mov br L1: ld L2: p1,p0 = a,0 L1 ;; x = 5 L2 ;; x = [p] The assignment to x that is executed is control dependent on the predicate (p1) in the first branch instruction.
Example 3 Code from Example 1 generated using predication cmp.ne.unc p1,p2 = a,0 ;; (p2) mov x = 5 (p1) ld x = [p] In Example 3 (page 7), all branches have been eliminated and the assignments to x are now data-dependent upon the compare that defines the qualifying predicate. Control speculation Control speculation is the execution of an instruction before the execution of all of the conditions controlling its execution.
Example 5 Code from Example 4 using control speculation ld.s add cmp.ne.unc (p1) chk.s L1: ... L2: ld add br t1 = [p] ;; b = t1,2 p1,p0 = condition,0 ;; t1, L2 t1 = [p] ;; b = t1,2 L1 If the NaT bit on register t1 is set, the chk.s instruction branches to the recovery code located at L2. Recovery code reloads t1 without speculation, then recomputes the result in b. (p1) is a predicate used to determine whether the result is needed.
compiler to trigger execution of a recovery code sequence when an address conflict is discovered during runtime. The compiler utilizes the advanced check (chk.a) instruction that checks to see if there have been any conflicting writes to the address accessed by the advanced load. If such a conflict occurs, the advanced check branches to compiler-generated recovery code where the load is re-executed to ensure the correct value. Example 7 Generated code for Example 6 using data speculation ld.a add st chk.
What’s new in HP compiler A.06.26 Following are the changes in HP aC++/HP C compiler version A.06.
compliance to ISO/IEC 1990 and on 11i Version 3 (11.31) Integrity systems, the UNIX 2003 standard, which includes ISO/IEC 9899:1999; the HP Fortran compiler adheres to ISO/IEC 1539-1: 1997; the HP aC++ compiler is largely compliant with the ISO/IEC 14882 standard for the C++ language (including the C++ standard library). Starting A.06.25, the aC++ compiler also supports certain language features of the C++11 standard, enabled using the -Ax command line options. For more information, see the HP aC++/HP C A.
the newer 64-bit data model where longs and pointers are 64 bits wide. The traditional 32-bit data model is appropriate for many legacy applications which may not be 64-bit clean. Many other compilers require the application to comply with the 64-bit data model which usually requires a separate 64-bit migration step for legacy applications. To extend the lifetime of new applications for Integrity servers, HP compilers provide several code scheduling options.
reasonable constraints on compile time, innermost loops are subject to software pipelining. The software pipeliner takes advantage of the special branches and rotating registers provided in the architecture to generate software pipelined loops with little or no code expansion, even in the presence of control flow and non-counted loops (see “Reference 5” (page 35)). Profile-based optimization HP is a leader in the delivery of profile-based optimization (PBO) (see “Reference 2” (page 35)).
instrumentation code to collect edge weights, data access address strides, and loop iteration counts. When the binary is subsequently run, in addition to the profile data collected by the instrumented code, HP Caliper samples load data cache profile information using the performance monitor unit (PMU). All collected profile information is written into a data file, which is used by the compiler for the subsequent +Oprofile=use build.
Example 8 Typical use of the ESTIMATED_FREQUENCY pragma if (condition) { #pragma ESTIMATED_FREQUENCY 0.99 ... for (...) { #pragma ESTIMATED_FREQUENCY 4.0 ... } } else { ... } In Example 8 (page 15), the code in the then clause of the if statement is expected to execute 99% of the time (implying that the else clause is executed 1% of the time). The loop is expected to execute four iterations, on average.
allowing the high-level optimizer to transform the indirect call into a test and a direct call. The inliner framework has been designed to scale to very large applications. It uses a novel and fast underlying algorithm and employs an elaborate set of heuristics to guide its inlining decisions. The inlining engine is also employed at +O2 for intra-module inlining. At this optimization level the inliner uses tuned down heuristics in order to guarantee fast compile times.
• Recognition of global, static and local variables that are assigned but never used allows the optimizer to remove dead code (which may result in additional dead variables). • Conversion of global variables that are referenced only within a module allows the high level optimizer to convert the symbol to a private symbol, guaranteeing that it can only be accessed from within this module. This gives the low-level optimizer greater freedom in optimizing references to that variable.
compiler and handles many additional cases; the +Oinfo option can be used to determine whether this optimization has been performed. Although full interprocedural optimizations are only available in the presence of -ipo or +O4, the compiler performs “lightweight” interprocedural optimization at +O2 and above. This phase can improve performance of applications with frequent use of static variables and functions.
The loop optimizer also performs some new optimizations: • Automatic parallelization. This optimization allows applications to exploit otherwise idle resources on multicore or multiprocessor systems by automatically transforming serial loops into multithreaded parallel code. When the +Oautopar option is used at optimization levels three (+O3) and above, the compiler automatically parallelizes those loops that are deemed safe and profitable by the loop transformer.
The high level scalar optimizer performs expression simplification and canonicalization, SSA-based dead code removal, copy propagation, constant propagation, and register promotion, as well as control flow optimizations and basic block cloning. The interprocedural optimization framework (enabled with -ipo at optimization level +O2 or higher) has been designed to scale to very large applications.
However, in general, we would expect +O4 to be no worse than 2x slower than +O2 (depending on the application’s build mechanics and the build machines). Precise floating-point control HP compilers are designed to provide complete developer access to the uniquely powerful floating-point features of the architecture. These features enable HP compiler-generated floating-point code and the math library to be both highly accurate and well optimized under default and general compiler options.
In addition, the HP C compiler provides a choice of three decimal floating-point evaluation methods, indicated by the -fpevaldec={_Decimal32|_Decimal64| _Decimal128} option, analogous to their binary floating-point counterparts. • _Decimal32, the default, evaluates decimal floating operations and constants to their semantic type. • _Decimal64 evaluates _Decimal32 operations and constants to the wider range and precision of _Decimal64, and other operations and constants to their semantic type.
+Onosumreduction option will disallow the sum reduction optimization under any setting of +Ofltacc. • The +Ocxlimitedrange option indicates complex multiply, divide, and cabs operations are not required to satisfy C99 infinity properties, and allows extended and long double versions to be more likely to encounter undue over/underflow. This functionality can also be chosen with #pragma STDC CX_LIMITED_RANGE ON in the source code at the desired scope, and is implied by +Ofltacc=relaxed.
to instruction-level events. Global measurements report total values for critical performance elements such as cache and TLB misses, branch mispredictions, pipeline stalls, instructions executed, and so on. Global measurements are a quick way to find performance problems. Sampled measurements report the same performance metrics as global, but they are sampled during application runtime and correlated to program locations.
to create applications which are suitable for any Itanium processor, using the +DS{blended|itanium2|montecito|poulson|native} compiler option. • The default option +DSblended specifies code scheduling that runs reasonably well on all implementations. • The +DSpoulson, +DSitanium2, and +DSmontecito options select code optimized for these processors. • The +DSmontecito option also selects code optimized for the Montvale and Tukwila implementations.
You can specify large pages using the linker +pd and +pi options, or using the chatr(1) command. It is often worth testing a wide range of page sizes, as application performance can vary unpredictably. Describing application characteristics HP compilers support several options and function attributes that describe the coding style used by the application. These options allow the compiler to make assumptions about the behavior of the application.
Many wrappers around malloc() obey these rules. • The non_exposing attribute indicates that the given function does not cause any address it can derive from any of its formal parameters to become visible after a call to the function returns. An address becomes visible if the function returns a value from which it can be directly derived, or if the function stores it in a memory location that is visible (can be referenced directly or indirectly) after the call to the function returns.
Tuning with profile-based optimization Profile-based optimization (PBO) is likely to be worthwhile if: • The application contains a large number of control-flow branches. • The application contains a large number of indirect branches (for example, C++ virtual function calls) and the -ipo option is used. • HP Caliper data indicates high branch misprediction rates, high numbers of case statement layout optimization opportunities, or large numbers of if-convert opportunities for hot branches.
The general optimization strategies heretofore can be applied without loss of floating-point quality. Here are some additional suggestions: • Specific optimizations involving math library functions are done only if the source file includes the math headers, such as , that declare the function.
• +Osumreduction can provide a performance gain when the application contains sum reductions that do not require strict ordering of their partial sums, but cannot use +Ofltacc=relaxed. • +FPD or a call to fesetflushtozero(1) are suitable when the application is tolerant of zero being delivered in lieu of denormal result values. Flush-to-zero mode can significantly speed up some computations with the float type.
The option +Ofaster is an alias for +Ofast +O4 and is therefore ideally suited for cross-module optimizations. Because both +Ofast and [+Ofaster] imply +Ofltacc=relaxed, they are not alone appropriate for tuning floating-point code that requires more rigorous floating-point behavior. However, they can be made appropriate by taking advantage of the compiler’s general left-to-right option processing.
in incorrect pointer accesses. Use the +Ocross_region_addressing option to prevent this problem when an application cannot be rewritten to avoid such pointer arithmetic. • The option +Oinitcheck will direct the compiler to detect and initialize all uninitialized variables. Application code that contains uninitialized variables can show unexpected behavior after optimization.
Table 3 Options and pragmas for troubleshooting problems in optimized code Option Purpose +Oinitcheck Initializes all potentially-uninitialized variables with zero. +Oparminit Initializes to zero any unspecified function parameters at call sites, to avoid NaT values. +Ofltacc=strict Prevents all value-changing optimizations, even contractions.
http://www.hp.com/go/hpcaliper Information about HP Wildebeest Debugger (WDB) is available at: http://www.hp.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Peter Markstein, Itanium processor family and Elementary Functions, Speed and Precision, Hewlett-Packard Professional Books, Prentice Hall, Inc., 2000. Pettis, K. and Hansen, R.C., Profile Guided Code Positioning, Proceedings of the SIGPLAN ’90 Conference on Programming Language Design and Implementation, SIGPLAN Notices, Vol 25, No. 6, June 1990. R. Ju, K. Nomura, U. Mahadevan, and L-C.
16. aC++ Version 6 Features to Improve Developer Productivity, http://www.hp.com/ go/hpux-C-Integrity-docs. 17. Inline assembly for Itanium®-based HP-UX, http://h21007.www2.hp.com/portal/ site/dspp/menuitem.863c3e4cbcdc3f3515b49c108973a801/? ciid=4308e2f5bde02110e2f5bde02110275d6e10RCRD, 2011. 18. aC++ standard conformance and compatibility changes, http:// h21007.www2.hp.com/portal/site/dspp/ menuitem.863c3e4cbcdc3f3515b49c108973a801/? ciid=2708d7c682f02110d7c682f02110275d6e10RCRD, 2004. 19.
Index Symbols +inline_level, 28 -fpevaldec, 22 11i Version 3 (11.
fesetflushtozero, 23, 30 finding hot spots, 23 fine-tuning profile data, 14 floating operations, evaluating, 21 floating point computation, controlling accuracy, 22 floating point control, 21 floating point optimizations, 22 floating-point numerical code, tuning, 28 flush-to-zero, 23, 30 fsplit, 32 function arguments, 16 function inlining, 16 functions frequently called, 14 library, optimizing calls to, 24 library, statically binding calls, 24 rarely called, 14 G global code motion, 12 GNU, 11 gprof , 24 g
+DSnative, 25, 30 +FPD, 23, 30 +mergeseg, 30 +O0, 33 +Ocross_region_addressing, 33 +Ocxlimitedrange, 23, 29 +Ofast/+Ofaster, 30 +Ofenvaccess, 23 +Ofltacc, 22, 29, 30, 32 +Oinitcheck, 32, 33 +Ointeger_overflow, 30, 32, 33 +Olibcalls, 30 +Olibmerrno, 23, 29 +Onolimit, 30 +Oparminit, 32, 33 +Oparmsoverlap, 26 +Oprofile=collect, 13 +Oprofile=use, 13, 15 +Optrs_to_globals, 26 +Orarely_called, 14 +Osumreduction, 22, 30 +Otype_safety, 26 +pd/+pi, 25, 30 -a,archive_shared, 29 -dynamic, 25 -exec, 25 -fpeval, 21 -fpw
strength reduction, 12 stride, 13 structure splitting, 17 sub-expression elimination, 12 substituting profile data, 14 sum reduction, 12 synthesis, post-increment, 12 system libraries, 4 T tokens, NaT, 7 tools for performance analysis, 23 transforming a control dependency, 6 troubleshooting, 31 optimization problems, 31 options and pragmas, 32 unconsumed NaT tokens, 32 uninitialized variables, 32 Tru64, 11 tuning cross-module optimization, 28 floating-point numerical code, 28 general purpose options, 27 in