Optimizing Serviceguard Failover Time, Version A.11.19 and later, April 2009

Linux, a journaled file system (such as Reiser FS and ext3 FS) is faster than a non-journaled file system

(ext2).

Adding or removing IP addresses takes some time and affects failover time. On HP-UX 11i, it takes at

least one second longer to add an IPv6 address if you enable duplicate address detection (DAD). For

more information, see “IPv6 Relocatable Address and Duplicate Address Detection” in the Managing

Serviceguard manual for your version for Serviceguard, available from the HP Technical

Documentation site at www.docs.hp.com/hpux/ha à Serviceguard.

Make your package configuration as efficient as possible. The time needed to start and stop services

adds to the total failover time. Streamline any customer-defined functions to help save time.

System restart options

Generally shutdown(1M) is preferable option for restarting the system. System shutdown halts all

user applications and invokes cmhaltnode to halt the Serviceguard cluster on the system. System

buffer is also flushed to the disk so almost all data is stored to the disk.

System restart with the reboot(1M) command has a big impact on Serviceguard component of

failover time when CVM or CFS is used. To optimize this failover time, make sure that Serviceguard is

cleanly halted on a node before rebooting it by using cmhaltcl (to halt the entire cluster);

cmhaltnode (to halt just this node); or the shutdown(1M) command (which will perform a

cmhaltnode before rebooting the node).

Applications

There is no single solution to optimizing the efficiency of the many different applications. However,

here are some general tips.

Failed-over applications may spend a long time recovering data. See if you can reduce this time.

Consider contacting the application vendor and the systems integrator for specific tuning tips.

If a database management system is used, consider implementing Oracle RAC. RAC significantly

reduces resource recovery after the failure of a RAC instance, so it helps reduce application recovery

time. To use Oracle RAC in a Serviceguard environment, additionally configure Serviceguard

Extension for RAC.

Conclusion

The total failover time in a Serviceguard cluster depends on many different factors and the interactions

between them. Serviceguard users can optimize failover time to help maximize the time applications

are available.

Optimizing failover time means finding a balance between waiting longer than needed to act and

acting too hastily. Less time spent in the failover process means less time packages are unavailable. If

a cluster detects failures quickly, it can re-form and restart its applications quickly. But if the timing is

set too aggressively, it can result in unnecessary—and possibly repeated—failovers, ultimately

decreasing overall availability.

The total time for failover includes a Serviceguard component and an application-dependent

component. Over-emphasis on optimizing only one component can result in higher unavailability.

To help reduce the time for the Serviceguard component, optimize the setting of MEMBER_TIMEOUT.

To reduce the time for application startup and recovery, fine-tune the parameters in the package

configuration file and follow the recommendations in each application’s documentation.