HP Compilers for HP Integrity Servers (September 2011)

+Osumreduction can provide a performance gain when the application contains
sum reductions that do not require strict ordering of their partial sums, but cannot
use +Ofltacc=relaxed.
+FPD or a call to fesetflushtozero(1) are suitable when the application is
tolerant of zero being delivered in lieu of denormal result values. Flush-to-zero mode
can significantly speed up some computations with the float type.
The following techniques can provide significant performance gains when algorithms
can be redesigned or re-implemented in new code:
The 80-bit extended type arithmetic is essentially as fast as float or double.
The speed of an extended math function is typically about 0.7 times that of the
corresponding double function. There may be an overall performance gain if the
extra precision and range allow the removal of branches to special code for handling
rounding errors and underflow and overflow conditions. In addition, the extra range
and precision of the extended type can result in simpler, more robust application
code that is easier to maintain. Even the 128-bit long double (quad) type, whose
functions are typically within 0.25 times as fast as corresponding routines for
extended types, can be considered in high performance code where extreme
precision is needed locally.
Replace portions of the implementation with inline assembly.
Allowing optimization flexibility
The compiler options +Ofast and +Ofaster direct the compiler to use typical collections
of aggressive optimization options that are safe for most applications. While the features
included in +Ofast and +Ofaster may evolve from release to release, +Ofast currently
implies the following:
+O2 requests level two optimization.
+Onolimit allows full optimization of large procedures, possibly at the expense
of longer compile time.
+Ofltacc=relaxed (see “Precise floating-point control” (page 21)).
+FPD enables the flush-to-zero rounding mode on the hardware.
+DSnative directs code scheduling specialized for the type of system on which
compilation is taking place (see “Scheduling for the processor” (page 24)).
-Wl,+pi,1M and -Wl,+pd,1M causes the application to utilize 1Mbyte instruction
and data virtual memory page sizes, respectively.
-Wl,+mergeseg causes the dynamic loader to merge the data segments of shared
libraries loaded at runtime, which allows the kernel to use larger size page table
entries. Note that the use of this option increases the size of the Resident Set Size
(RSS) and may degrade the performance of short-lived programs.
30 HP compilers for HP Integrity servers