Managing HP Serviceguard A.11.20.10 for Linux, December 2012

A Designing Highly Available Cluster Applications
This appendix describes how to create or port applications for high availability, with emphasis on
the following topics:
Automating Application Operation
Controlling the Speed of Application Failover (page 258)
Designing Applications to Run on Multiple Systems (page 261)
Restoring Client Connections (page 264)
Handling Application Failures (page 265)
Minimizing Planned Downtime (page 266)
Designing for high availability means reducing the amount of unplanned and planned downtime
that users will experience. Unplanned downtime includes unscheduled events such as power
outages, system failures, network failures, disk crashes, or application failures. Planned downtime
includes scheduled events such as scheduled backups, system upgrades to new OS revisions, or
hardware replacements.
Two key strategies should be kept in mind:
1. Design the application to handle a system reboot or panic. If you are modifying an existing
application for a highly available environment, determine what happens currently with the
application after a system panic. In a highly available environment there should be defined
(and scripted) procedures for restarting the application. Procedures for starting and stopping
the application should be automatic, with no user intervention required.
2. The application should not use any system-specific information such as the following if such
use would prevent it from failing over to another system and running properly:
The application should not refer to uname() or gethostname().
The application should not refer to the SPU ID.
The application should not refer to the MAC (link-level) address.
A.1 Automating Application Operation
Can the application be started and stopped automatically or does it require operator intervention?
This section describes how to automate application operations to avoid the need for user intervention.
One of the first rules of high availability is to avoid manual intervention. If it takes a user at a
terminal, console or GUI interface to enter commands to bring up a subsystem, the user becomes
a key part of the system. It may take hours before a user can get to a system console to do the
work necessary. The hardware in question may be located in a far-off area where no trained users
are available, the systems may be located in a secure datacenter, or in off hours someone may
have to connect via modem.
There are two principles to keep in mind for automating application relocation:
Insulate users from outages.
Applications must have defined startup and shutdown procedures.
You need to be aware of what happens currently when the system your application is running on
is rebooted, and whether changes need to be made in the application's response for high
availability.
A.1.1 Insulate Users from Outages
Wherever possible, insulate your end users from outages. Issues include the following:
Do not require user intervention to reconnect when a connection is lost due to a failed server.
Where possible, warn users of slight delays due to a failover in progress.
A.1 Automating Application Operation 257