Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Designing Highly Available Cluster Applications
Automating Application Operation
Appendix B 299
Define Application Startup and Shutdown
Applications must be restartable without manual intervention. If the
application requires a switch to be flipped on a piece of hardware, then
automated restart is impossible. Procedures for application startup,
shutdown and monitoring must be created so that the HA software can
perform these functions automatically.
To ensure automated response, there should be defined procedures for
starting up the application and stopping the application. In Serviceguard
these procedures are placed in the package control script. These
procedures must check for errors and return status to the HA control
software. The startup and shutdown should be command-line driven and
not interactive unless all of the answers can be predetermined and
scripted.
In an HA failover environment, HA software restarts the application on
a surviving system in the cluster that has the necessary resources, like
access to the necessary disk drives. The application must be restartable
in two aspects:
It must be able to restart and recover on the backup system (or on
the same system if the application restart option is chosen).
It must be able to restart if it fails during the startup and the cause
of the failure is resolved.
Application administrators need to learn to startup and shutdown
applications using the appropriate HA commands. Inadvertently
shutting down the application directly will initiate an unwanted failover.
Application administrators also need to be careful that they don't
accidently shut down a production instance of an application rather than
a test instance in a development environment.
A mechanism to monitor whether the application is active is necessary so
that the HA software knows when the application has failed. This may
be as simple as a script that issues the command ps -ef | grep xxx for
all the processes belonging to the application.
To reduce the impact on users, the application should not simply abort in
case of error, since aborting would cause an unneeded failover to a
backup system. Applications should determine the exact error and take
specific action to recover from the error rather than, for example,
aborting upon receipt of any error.