HP Compilers for HP Integrity Servers (September 2011)

reasonable constraints on compile time, innermost loops are subject to software pipelining.
The software pipeliner takes advantage of the special branches and rotating registers
provided in the architecture to generate software pipelined loops with little or no code
expansion, even in the presence of control flow and non-counted loops (see “Reference
5” (page 35)).
Profile-based optimization
HP is a leader in the delivery of profile-based optimization (PBO) (see “Reference
2” (page 35)). PBO data provides the compiler with branch-taken and routine execution
frequency information as an additional guide to optimization. In addition, it provides
the compiler with data access address strides and data cache miss information, used to
guide data cache optimization and scheduling. It also provides the compiler with loop
iteration counts, used to guide loop optimization. PBO can provide as much as a 30%
performance improvement over +O2 optimization by tuning applications according to
their typical execution characteristics. The performance impact of PBO is even higher on
Itanium-based systems than on traditional RISC systems because the architecture provides
a larger number of mechanisms to increase instruction level parallelism based on
application behavior.
Many compiler optimizations are enhanced by knowledge of the execution behavior of
the application.
Certain optimizing transformations are performed on code regions. Profile data
helps these transformations select target regions to minimize region crossings within
high frequency execution paths.
Selection of instructions within a region to speculate or predicate is more effective
when the compiler has more accurate information on relative execution frequencies.
High-level optimizations such as loop optimization and procedure inlining can greatly
benefit from profile data to select particularly hot loops and call sites for optimization.
The optimizer can insert more efficient prefetches for linked-list recurrences, if the
PBO data indicates that the accesses have a regular stride.
Cache utilization is enhanced by ordering global and static variables within the
data segment such that frequently accessed variables are placed close together.
For loops that commonly iterate only a few times, as indicated by the loop iteration
count PBO data, the optimizer can peel” off that number of iterations into straight-line
code. This can improve instruction level parallelism by allowing greater scheduling
freedom for the peeled instructions.
Scheduling is enhanced by accounting for data cache misses on integer accesses
and either reordering the loads or scheduling uses farther away.
On Integrity servers, the two-step PBO process can be done through the
+Oprofile=collect build, followed by the +Oprofile=use build, similar to the
process on PA-RISC systems. The first step of the build (+Oprofile=collect) inserts
Understanding key features of the HP compilers 13