Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Designing Highly Available Cluster Applications
Restoring Client Connections
Appendix B314
Restoring Client Connections
How does a client reconnect to the server after a failure?
It is important to write client applications to specifically differentiate
between the loss of a connection to the server and other
application-oriented errors that might be returned. The application
should take special action in case of connection loss.
One question to consider is how a client knows after a failure when to
reconnect to the newly started server. The typical scenario is that the
client must simply restart their session, or relog in. However, this
method is not very automated. For example, a well-tuned hardware and
application system may fail over in 5 minutes. But if users, after
experiencing no response during the failure, give up after 2 minutes and
go for coffee and don't come back for 28 minutes, the perceived downtime
is actually 30 minutes, not 5. Factors to consider are the number of
reconnection attempts to make, the frequency of reconnection attempts,
and whether or not to notify the user of connection loss.
There are a number of strategies to use for client reconnection:
Design clients which continue to try to reconnect to their failed
server.
Put the work into the client application rather than relying on the
user to reconnect. If the server is back up and running in 5 minutes,
and the client is continually retrying, then after 5 minutes, the client
application will reestablish the link with the server and either
restart or continue the transaction. No intervention from the user is
required.
Design clients to reconnect to a different server.
If you have a server design which includes multiple active servers,
the client could connect to the second server, and the user would only
experience a brief delay.
The problem with this design is knowing when the client should
switch to the second server. How long does a client retry to the first
server before giving up and going to the second server? There are no
definitive answers for this. The answer depends on the design of the
server application. If the application can be restarted on the same
node after a failure (see “Handling Application Failures” following),