HP Compilers for HP Integrity Servers (September 2011)

• +Osumreduction can provide a performance gain when the application contains

sum reductions that do not require strict ordering of their partial sums, but cannot

use +Ofltacc=relaxed.

• +FPD or a call to fesetflushtozero(1) are suitable when the application is

tolerant of zero being delivered in lieu of denormal result values. Flush-to-zero mode

can significantly speed up some computations with the float type.

The following techniques can provide significant performance gains when algorithms

can be redesigned or re-implemented in new code:

• The 80-bit extended type arithmetic is essentially as fast as float or double.

The speed of an extended math function is typically about 0.7 times that of the

corresponding double function. There may be an overall performance gain if the

extra precision and range allow the removal of branches to special code for handling

rounding errors and underflow and overflow conditions. In addition, the extra range

and precision of the extended type can result in simpler, more robust application

code that is easier to maintain. Even the 128-bit long double (quad) type, whose

functions are typically within 0.25 times as fast as corresponding routines for

extended types, can be considered in high performance code where extreme

precision is needed locally.

• Replace portions of the implementation with inline assembly.

Allowing optimization flexibility

The compiler options +Ofast and +Ofaster direct the compiler to use typical collections

of aggressive optimization options that are safe for most applications. While the features

included in +Ofast and +Ofaster may evolve from release to release, +Ofast currently

implies the following:

• +O2 requests level two optimization.

• +Onolimit allows full optimization of large procedures, possibly at the expense

of longer compile time.

• +Ofltacc=relaxed (see “Precise floating-point control” (page 21)).

• +FPD enables the flush-to-zero rounding mode on the hardware.

• +DSnative directs code scheduling specialized for the type of system on which

compilation is taking place (see “Scheduling for the processor” (page 24)).

• -Wl,+pi,1M and -Wl,+pd,1M causes the application to utilize 1Mbyte instruction

and data virtual memory page sizes, respectively.

• -Wl,+mergeseg causes the dynamic loader to merge the data segments of shared

libraries loaded at runtime, which allows the kernel to use larger size page table

entries. Note that the use of this option increases the size of the Resident Set Size

(RSS) and may degrade the performance of short-lived programs.

30 HP compilers for HP Integrity servers