Managing HP Serviceguard A.11.20.10 for Linux, December 2012

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux Cluster

261

262

263

264

265

266

267

268

269

270

give up after 2 minutes and go for coffee and don't come back for 28 minutes, the perceived

downtime is actually 30 minutes, not 5. Factors to consider are the number of reconnection attempts

to make, the frequency of reconnection attempts, and whether or not to notify the user of connection

loss.

There are a number of strategies to use for client reconnection:

• Design clients which continue to try to reconnect to their failed server.

Put the work into the client application rather than relying on the user to reconnect. If the server

is back up and running in 5 minutes, and the client is continually retrying, then after 5 minutes,

the client application will reestablish the link with the server and either restart or continue the

transaction. No intervention from the user is required.

• Design clients to reconnect to a different server.

If you have a server design which includes multiple active servers, the client could connect to

the second server, and the user would only experience a brief delay.

The problem with this design is knowing when the client should switch to the second server.

How long does a client retry to the first server before giving up and going to the second server?

There are no definitive answers for this. The answer depends on the design of the server

application. If the application can be restarted on the same node after a failure (see “Handling

Application Failures ” following), the retry to the current server should continue for the amount

of time it takes to restart the server locally. This will keep the client from having to switch to

the second server in the event of a application failure.

• Use a transaction processing monitor or message queueing software to increase robustness.

Use transaction processing monitors such as Tuxedo or DCE/Encina, which provide an interface

between the server and the client. Transaction processing monitors (TPMs) can be useful in

creating a more highly available application. Transactions can be queued such that the client

does not detect a server failure. Many TPMs provide for the optional automatic rerouting to

alternate servers or for the automatic retry of a transaction. TPMs also provide for ensuring

the reliable completion of transactions, although they are not the only mechanism for doing

this. After the server is back online, the transaction monitor reconnects to the new server and

continues routing it the transactions.

• Queue Up Requests

As an alternative to using a TPM, queue up requests when the server is unavailable. Rather

than notifying the user when a server is unavailable, the user request is queued up and

transmitted later when the server becomes available again. Message queueing software

ensures that messages of any kind, not necessarily just transactions, are delivered and

acknowledged.

Message queueing is useful only when the user does not need or expect response that the

request has been completed (that is, the application is not interactive).

A.5 Handling Application Failures

What happens if part or all of an application fails?

All of the preceding sections have assumed the failure in question was not a failure of the

application, but of another component of the cluster. This section deals specifically with application

problems. For instance, software bugs may cause an application to fail, or system resource issues

(such as low swap/memory space) may cause an application to die. The section deals with how

to design your application to recover after these types of failures.

A.5.1 Create Applications to be Failure Tolerant

An application should be tolerant to failure of a single component. Many applications have multiple

processes running on a single node. If one process fails, what happens to the other processes? Do

they also fail? Can the failed process be restarted on the same node without affecting the remaining

pieces of the application?

A.5 Handling Application Failures 265