Managing HP Serviceguard for Linux Ninth Edition, April 2009

restart the server locally. This will keep the client from having to switch to the
second server in the event of a application failure.
Use a transaction processing monitor or message queueing software to increase
robustness.
Use transaction processing monitors such as Tuxedo or DCE/Encina, which provide
an interface between the server and the client. Transaction processing monitors
(TPMs) can be useful in creating a more highly available application. Transactions
can be queued such that the client does not detect a server failure. Many TPMs
provide for the optional automatic rerouting to alternate servers or for the automatic
retry of a transaction. TPMs also provide for ensuring the reliable completion of
transactions, although they are not the only mechanism for doing this. After the
server is back online, the transaction monitor reconnects to the new server and
continues routing it the transactions.
Queue Up Requests
As an alternative to using a TPM, queue up requests when the server is unavailable.
Rather than notifying the user when a server is unavailable, the user request is
queued up and transmitted later when the server becomes available again. Message
queueing software ensures that messages of any kind, not necessarily just
transactions, are delivered and acknowledged.
Message queueing is useful only when the user does not need or expect response
that the request has been completed (i.e, the application is not interactive).
Handling Application Failures
What happens if part or all of an application fails?
All of the preceding sections have assumed the failure in question was not a failure of
the application, but of another component of the cluster. This section deals specifically
with application problems. For instance, software bugs may cause an application to
fail, or system resource issues (such as low swap/memory space) may cause an
application to die. The section deals with how to design your application to recover
after these types of failures.
Create Applications to be Failure Tolerant
An application should be tolerant to failure of a single component. Many applications
have multiple processes running on a single node. If one process fails, what happens
to the other processes? Do they also fail? Can the failed process be restarted on the
same node without affecting the remaining pieces of the application?
Ideally, if one process fails, the other processes can wait a period of time for that
component to come back online. This is true whether the component is on the same
system or a remote system. The failed component can be restarted automatically on
the same system and rejoin the waiting processing and continue on. This type of failure
can be detected and restarted within a few seconds, so the end user would never know
a failure occurred.
300 Designing Highly Available Cluster Applications