Managing HP Serviceguard for Linux Ninth Edition, April 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux RH AS ProLiant Cluster

291

292

293

294

295

296

297

298

299

300

restart the server locally. This will keep the client from having to switch to the

second server in the event of a application failure.

• Use a transaction processing monitor or message queueing software to increase

robustness.

Use transaction processing monitors such as Tuxedo or DCE/Encina, which provide

an interface between the server and the client. Transaction processing monitors

(TPMs) can be useful in creating a more highly available application. Transactions

can be queued such that the client does not detect a server failure. Many TPMs

provide for the optional automatic rerouting to alternate servers or for the automatic

retry of a transaction. TPMs also provide for ensuring the reliable completion of

transactions, although they are not the only mechanism for doing this. After the

server is back online, the transaction monitor reconnects to the new server and

continues routing it the transactions.

• Queue Up Requests

As an alternative to using a TPM, queue up requests when the server is unavailable.

Rather than notifying the user when a server is unavailable, the user request is

queued up and transmitted later when the server becomes available again. Message

queueing software ensures that messages of any kind, not necessarily just

transactions, are delivered and acknowledged.

Message queueing is useful only when the user does not need or expect response

that the request has been completed (i.e, the application is not interactive).

Handling Application Failures

What happens if part or all of an application fails?

All of the preceding sections have assumed the failure in question was not a failure of

the application, but of another component of the cluster. This section deals specifically

with application problems. For instance, software bugs may cause an application to

fail, or system resource issues (such as low swap/memory space) may cause an

application to die. The section deals with how to design your application to recover

after these types of failures.

Create Applications to be Failure Tolerant

An application should be tolerant to failure of a single component. Many applications

have multiple processes running on a single node. If one process fails, what happens

to the other processes? Do they also fail? Can the failed process be restarted on the

same node without affecting the remaining pieces of the application?

Ideally, if one process fails, the other processes can wait a period of time for that

component to come back online. This is true whether the component is on the same

system or a remote system. The failed component can be restarted automatically on

the same system and rejoin the waiting processing and continue on. This type of failure

can be detected and restarted within a few seconds, so the end user would never know

a failure occurred.

300 Designing Highly Available Cluster Applications