Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

13
Linux, a journaled file system (such as Reiser FS and ext3 FS) is faster than a non-journaled file system
(ext2).
Adding or removing IP addresses takes some time and affects failover time. On HP-UX 11i, it takes at
least one second longer to add an IPv6 address if you enable duplicate address detection (DAD). For
more information, see “IPv6 Relocatable Address and Duplicate Address Detection” in the Managing
Serviceguard manual for your version for Serviceguard, available from the HP Technical
Documentation site at www.docs.hp.com/hpux/ha à Serviceguard.
Make your package configuration as efficient as possible. The time needed to start and stop services
adds to the total failover time. Streamline any customer-defined functions to help save time.
System restart options
Generally shutdown(1M) is preferable option for restarting the system. System shutdown halts all
user applications and invokes cmhaltnode to halt the Serviceguard cluster on the system. System
buffer is also flushed to the disk so almost all data is stored to the disk.
System restart with the reboot(1M) command has a big impact on Serviceguard component of
failover time when CVM or CFS is used. To optimize this failover time, make sure that Serviceguard is
cleanly halted on a node before rebooting it by using cmhaltcl (to halt the entire cluster);
cmhaltnode (to halt just this node); or the shutdown(1M) command (which will perform a
cmhaltnode before rebooting the node).
Applications
There is no single solution to optimizing the efficiency of the many different applications. However,
here are some general tips.
Failed-over applications may spend a long time recovering data. See if you can reduce this time.
Consider contacting the application vendor and the systems integrator for specific tuning tips.
If a database management system is used, consider implementing Oracle RAC. RAC significantly
reduces resource recovery after the failure of a RAC instance, so it helps reduce application recovery
time. To use Oracle RAC in a Serviceguard environment, additionally configure Serviceguard
Extension for RAC.
Conclusion
The total failover time in a Serviceguard cluster depends on many different factors and the interactions
between them. Serviceguard users can optimize failover time to help maximize the time applications
are available.
Optimizing failover time means finding a balance between waiting longer than needed to act and
acting too hastily. Less time spent in the failover process means less time packages are unavailable. If
a cluster detects failures quickly, it can re-form and restart its applications quickly. But if the timing is
set too aggressively, it can result in unnecessaryand possibly repeatedfailovers, ultimately
decreasing overall availability.
The total time for failover includes a Serviceguard component and an application-dependent
component. Over-emphasis on optimizing only one component can result in higher unavailability.
To help reduce the time for the Serviceguard component, optimize the setting of MEMBER_TIMEOUT.
To reduce the time for application startup and recovery, fine-tune the parameters in the package
configuration file and follow the recommendations in each application’s documentation.