Managing HP Serviceguard for Linux, Tenth Edition, September 2012

Avoid File Locking
In an NFS environment, applications should avoid using file-locking mechanisms, where
the file to be locked is on an NFS Server. File locking should be avoided in an application
both on local and remote systems. If local file locking is employed and the system fails,
the system acting as the backup system will not have any knowledge of the locks
maintained by the failed system. This may or may not cause problems when the
application restarts.
Remote file locking is the worst of the two situations, since the system doing the locking
may be the system that fails. Then, the lock might never be released, and other parts of
the application will be unable to access that data. In an NFS environment, file locking
can cause long delays in case of NFS client system failure and might even delay the
failover itself.
Restoring Client Connections
How does a client reconnect to the server after a failure?
It is important to write client applications to specifically differentiate between the loss of
a connection to the server and other application-oriented errors that might be returned.
The application should take special action in case of connection loss.
One question to consider is how a client knows after a failure when to reconnect to the
newly started server. The typical scenario is that the client must simply restart their session,
or relog in. However, this method is not very automated. For example, a well-tuned
hardware and application system may fail over in 5 minutes. But if users, after
experiencing no response during the failure, give up after 2 minutes and go for coffee
and don't come back for 28 minutes, the perceived downtime is actually 30 minutes,
not 5. Factors to consider are the number of reconnection attempts to make, the frequency
of reconnection attempts, and whether or not to notify the user of connection loss.
There are a number of strategies to use for client reconnection:
Design clients which continue to try to reconnect to their failed server.
Put the work into the client application rather than relying on the user to reconnect.
If the server is back up and running in 5 minutes, and the client is continually retrying,
then after 5 minutes, the client application will reestablish the link with the server
and either restart or continue the transaction. No intervention from the user is
required.
Design clients to reconnect to a different server.
If you have a server design which includes multiple active servers, the client could
connect to the second server, and the user would only experience a brief delay.
The problem with this design is knowing when the client should switch to the second
server. How long does a client retry to the first server before giving up and going
to the second server? There are no definitive answers for this. The answer depends
on the design of the server application. If the application can be restarted on the
same node after a failure (see “Handling Application Failures following), the retry
to the current server should continue for the amount of time it takes to restart the
316 Designing Highly Available Cluster Applications