Managing HP Serviceguard for Linux, Tenth Edition, September 2012

A Designing Highly Available Cluster Applications
This appendix describes how to create or port applications for high availability, with
emphasis on the following topics:
Automating Application Operation
Controlling the Speed of Application Failover (page 308)
Designing Applications to Run on Multiple Systems (page 311)
Restoring Client Connections (page 316)
Handling Application Failures (page 317)
Minimizing Planned Downtime (page 318)
Designing for high availability means reducing the amount of unplanned and planned
downtime that users will experience. Unplanned downtime includes unscheduled events
such as power outages, system failures, network failures, disk crashes, or application
failures. Planned downtime includes scheduled events such as scheduled backups, system
upgrades to new OS revisions, or hardware replacements.
Two key strategies should be kept in mind:
1. Design the application to handle a system reboot or panic. If you are modifying an
existing application for a highly available environment, determine what happens
currently with the application after a system panic. In a highly available environment
there should be defined (and scripted) procedures for restarting the application.
Procedures for starting and stopping the application should be automatic, with no
user intervention required.
2. The application should not use any system-specific information such as the following
if such use would prevent it from failing over to another system and running properly:
The application should not refer to uname() or gethostname().
The application should not refer to the SPU ID.
The application should not refer to the MAC (link-level) address.
Automating Application Operation
Can the application be started and stopped automatically or does it require operator
intervention?
This section describes how to automate application operations to avoid the need for user
intervention. One of the first rules of high availability is to avoid manual intervention. If
it takes a user at a terminal, console or GUI interface to enter commands to bring up a
subsystem, the user becomes a key part of the system. It may take hours before a user
can get to a system console to do the work necessary. The hardware in question may
be located in a far-off area where no trained users are available, the systems may be
located in a secure datacenter, or in off hours someone may have to connect via modem.
306 Designing Highly Available Cluster Applications