Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Designing Highly Available Cluster Applications
Handling Application Failures
Appendix B316
Handling Application Failures
What happens if part or all of an application fails?
All of the preceding sections have assumed the failure in question was
not a failure of the application, but of another component of the cluster.
This section deals specifically with application problems. For instance,
software bugs may cause an application to fail or system resource issues
(such as low swap/memory space) may cause an application to die. The
section deals with how to design your application to recover after these
types of failures.
Create Applications to be Failure Tolerant
An application should be tolerant to failure of a single component. Many
applications have multiple processes running on a single node. If one
process fails, what happens to the other processes? Do they also fail? Can
the failed process be restarted on the same node without affecting the
remaining pieces of the application?
Ideally, if one process fails, the other processes can wait a period of time
for that component to come back online. This is true whether the
component is on the same system or a remote system. The failed
component can be restarted automatically on the same system and rejoin
the waiting processing and continue on. This type of failure can be
detected and restarted within a few seconds, so the end user would never
know a failure occurred.
Another alternative is for the failure of one component to still allow
bringing down the other components cleanly. If a database SQL server
fails, the database should still be able to be brought down cleanly so that
no database recovery is necessary.
The worse case is for a failure of one component to cause the entire
system to fail. If one component fails and all other components need to be
restarted, the downtime will be high.