Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Designing Highly Available Cluster Applications

Handling Application Failures

Appendix B316

Handling Application Failures

What happens if part or all of an application fails?

All of the preceding sections have assumed the failure in question was

not a failure of the application, but of another component of the cluster.

This section deals specifically with application problems. For instance,

software bugs may cause an application to fail or system resource issues

(such as low swap/memory space) may cause an application to die. The

section deals with how to design your application to recover after these

types of failures.

Create Applications to be Failure Tolerant

An application should be tolerant to failure of a single component. Many

applications have multiple processes running on a single node. If one

process fails, what happens to the other processes? Do they also fail? Can

the failed process be restarted on the same node without affecting the

remaining pieces of the application?

Ideally, if one process fails, the other processes can wait a period of time

for that component to come back online. This is true whether the

component is on the same system or a remote system. The failed

component can be restarted automatically on the same system and rejoin

the waiting processing and continue on. This type of failure can be

detected and restarted within a few seconds, so the end user would never

know a failure occurred.

Another alternative is for the failure of one component to still allow

bringing down the other components cleanly. If a database SQL server

fails, the database should still be able to be brought down cleanly so that

no database recovery is necessary.

The worse case is for a failure of one component to cause the entire

system to fail. If one component fails and all other components need to be

restarted, the downtime will be high.