HP Compilers for HP Integrity Servers (September 2011)

reasonable constraints on compile time, innermost loops are subject to software pipelining.

The software pipeliner takes advantage of the special branches and rotating registers

provided in the architecture to generate software pipelined loops with little or no code

expansion, even in the presence of control flow and non-counted loops (see “Reference

5” (page 35)).

Profile-based optimization

HP is a leader in the delivery of profile-based optimization (PBO) (see “Reference

2” (page 35)). PBO data provides the compiler with branch-taken and routine execution

frequency information as an additional guide to optimization. In addition, it provides

the compiler with data access address strides and data cache miss information, used to

guide data cache optimization and scheduling. It also provides the compiler with loop

iteration counts, used to guide loop optimization. PBO can provide as much as a 30%

performance improvement over +O2 optimization by tuning applications according to

their typical execution characteristics. The performance impact of PBO is even higher on

Itanium-based systems than on traditional RISC systems because the architecture provides

a larger number of mechanisms to increase instruction level parallelism based on

application behavior.

Many compiler optimizations are enhanced by knowledge of the execution behavior of

the application.

• Certain optimizing transformations are performed on code regions. Profile data

helps these transformations select target regions to minimize region crossings within

high frequency execution paths.

• Selection of instructions within a region to speculate or predicate is more effective

when the compiler has more accurate information on relative execution frequencies.

• High-level optimizations such as loop optimization and procedure inlining can greatly

benefit from profile data to select particularly hot loops and call sites for optimization.

• The optimizer can insert more efficient prefetches for linked-list recurrences, if the

PBO data indicates that the accesses have a regular stride.

• Cache utilization is enhanced by ordering global and static variables within the

data segment such that frequently accessed variables are placed close together.

• For loops that commonly iterate only a few times, as indicated by the loop iteration

count PBO data, the optimizer can “peel” off that number of iterations into straight-line

code. This can improve instruction level parallelism by allowing greater scheduling

freedom for the peeled instructions.

• Scheduling is enhanced by accounting for data cache misses on integer accesses

and either reordering the loads or scheduling uses farther away.

On Integrity servers, the two-step PBO process can be done through the

+Oprofile=collect build, followed by the +Oprofile=use build, similar to the

process on PA-RISC systems. The first step of the build (+Oprofile=collect) inserts

Understanding key features of the HP compilers 13