Managing HP Serviceguard for Linux, Tenth Edition, September 2012

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux License Kit

311

312

313

314

315

316

317

318

319

320

Avoid File Locking

In an NFS environment, applications should avoid using file-locking mechanisms, where

the file to be locked is on an NFS Server. File locking should be avoided in an application

both on local and remote systems. If local file locking is employed and the system fails,

the system acting as the backup system will not have any knowledge of the locks

maintained by the failed system. This may or may not cause problems when the

application restarts.

Remote file locking is the worst of the two situations, since the system doing the locking

may be the system that fails. Then, the lock might never be released, and other parts of

the application will be unable to access that data. In an NFS environment, file locking

can cause long delays in case of NFS client system failure and might even delay the

failover itself.

Restoring Client Connections

How does a client reconnect to the server after a failure?

It is important to write client applications to specifically differentiate between the loss of

a connection to the server and other application-oriented errors that might be returned.

The application should take special action in case of connection loss.

One question to consider is how a client knows after a failure when to reconnect to the

newly started server. The typical scenario is that the client must simply restart their session,

or relog in. However, this method is not very automated. For example, a well-tuned

hardware and application system may fail over in 5 minutes. But if users, after

experiencing no response during the failure, give up after 2 minutes and go for coffee

and don't come back for 28 minutes, the perceived downtime is actually 30 minutes,

not 5. Factors to consider are the number of reconnection attempts to make, the frequency

of reconnection attempts, and whether or not to notify the user of connection loss.

There are a number of strategies to use for client reconnection:

• Design clients which continue to try to reconnect to their failed server.

Put the work into the client application rather than relying on the user to reconnect.

If the server is back up and running in 5 minutes, and the client is continually retrying,

then after 5 minutes, the client application will reestablish the link with the server

and either restart or continue the transaction. No intervention from the user is

required.

• Design clients to reconnect to a different server.

If you have a server design which includes multiple active servers, the client could

connect to the second server, and the user would only experience a brief delay.

The problem with this design is knowing when the client should switch to the second

server. How long does a client retry to the first server before giving up and going

to the second server? There are no definitive answers for this. The answer depends

on the design of the server application. If the application can be restarted on the

same node after a failure (see “Handling Application Failures ” following), the retry

to the current server should continue for the amount of time it takes to restart the

316 Designing Highly Available Cluster Applications