Managing HP Serviceguard for Linux Ninth Edition, April 2009

Avoid File Locking
In an NFS environment, applications should avoid using file-locking mechanisms,
where the file to be locked is on an NFS Server. File locking should be avoided in an
application both on local and remote systems. If local file locking is employed and the
system fails, the system acting as the backup system will not have any knowledge of
the locks maintained by the failed system. This may or may not cause problems when
the application restarts.
Remote file locking is the worst of the two situations, since the system doing the locking
may be the system that fails. Then, the lock might never be released, and other parts
of the application will be unable to access that data. In an NFS environment, file locking
can cause long delays in case of NFS client system failure and might even delay the
failover itself.
Restoring Client Connections
How does a client reconnect to the server after a failure?
It is important to write client applications to specifically differentiate between the loss
of a connection to the server and other application-oriented errors that might be
returned. The application should take special action in case of connection loss.
One question to consider is how a client knows after a failure when to reconnect to the
newly started server. The typical scenario is that the client must simply restart their
session, or relog in. However, this method is not very automated. For example, a
well-tuned hardware and application system may fail over in 5 minutes. But if users,
after experiencing no response during the failure, give up after 2 minutes and go for
coffee and don't come back for 28 minutes, the perceived downtime is actually 30
minutes, not 5. Factors to consider are the number of reconnection attempts to make,
the frequency of reconnection attempts, and whether or not to notify the user of
connection loss.
There are a number of strategies to use for client reconnection:
Design clients which continue to try to reconnect to their failed server.
Put the work into the client application rather than relying on the user to reconnect.
If the server is back up and running in 5 minutes, and the client is continually
retrying, then after 5 minutes, the client application will reestablish the link with
the server and either restart or continue the transaction. No intervention from the
user is required.
Design clients to reconnect to a different server.
If you have a server design which includes multiple active servers, the client could
connect to the second server, and the user would only experience a brief delay.
The problem with this design is knowing when the client should switch to the
second server. How long does a client retry to the first server before giving up and
going to the second server? There are no definitive answers for this. The answer
depends on the design of the server application. If the application can be restarted
on the same node after a failure (see “Handling Application Failures following),
the retry to the current server should continue for the amount of time it takes to
Restoring Client Connections 299