Managing HP Serviceguard for Linux Ninth Edition, April 2009

A Designing Highly Available Cluster Applications
This appendix describes how to create or port applications for high availability, with
emphasis on the following topics:
Automating Application Operation
Controlling the Speed of Application Failover (page 291)
Designing Applications to Run on Multiple Systems (page 294)
Restoring Client Connections (page 299)
Handling Application Failures (page 300)
Minimizing Planned Downtime (page 301)
Designing for high availability means reducing the amount of unplanned and planned
downtime that users will experience. Unplanned downtime includes unscheduled
events such as power outages, system failures, network failures, disk crashes, or
application failures. Planned downtime includes scheduled events such as scheduled
backups, system upgrades to new OS revisions, or hardware replacements.
Two key strategies should be kept in mind:
1. Design the application to handle a system reboot or panic. If you are modifying
an existing application for a highly available environment, determine what happens
currently with the application after a system panic. In a highly available
environment there should be defined (and scripted) procedures for restarting the
application. Procedures for starting and stopping the application should be
automatic, with no user intervention required.
2. The application should not use any system-specific information such as the
following if such use would prevent it from failing over to another system and
running properly:
The application should not refer to uname() or gethostname().
The application should not refer to the SPU ID.
The application should not refer to the MAC (link-level) address.
Automating Application Operation
Can the application be started and stopped automatically or does it require operator
intervention?
This section describes how to automate application operations to avoid the need for
user intervention. One of the first rules of high availability is to avoid manual
intervention. If it takes a user at a terminal, console or GUI interface to enter commands
to bring up a subsystem, the user becomes a key part of the system. It may take hours
before a user can get to a system console to do the work necessary. The hardware in
question may be located in a far-off area where no trained users are available, the
systems may be located in a secure datacenter, or in off hours someone may have to
connect via modem.
Automating Application Operation 289