Availability Guide for Application Design

Overview of Server and Network Fault Tolerance
Availability Guide for Application Design525637-004
2-18
Additional Availability Problems in Client/Server
Networks
application, it is necessary that all components connecting the user to the server are
available.
Measuring Downtime of a Client/Server Application
Client/server designs also further complicate the way downtime must be measured. A
transient system error in a workstation is clearly a problem to the user of the
workstation; the application is unavailable to that user, but other users are not affected.
A transient error in the server, however, is more serious because potentially thousands
of users could be depending on its services.
In a client/server application, it therefore makes sense to measure downtime as the
number of minutes the application is unavailable multiplied by the number of affected
users. If the transient error in the workstation makes the application unavailable to one
user for 5 minutes, then it counts as 5 user-minutes of downtime. If the problem on the
server makes the application unavailable for 15 minutes to 100 users, then it counts as
1500 user-minutes of downtime.
Where the Problems Occur
Research has established that, using commodity servers, defects in the server are
responsible for about 60 percent of all user downtime. The network is responsible for
about 10 percent. The remaining 30 percent of user downtime is divided between the
client and environmental causes. The fact that servers are the primary cause of end-
user outages is no surprise because any problem is magnified by the number of users
that are using that service.
Propagating Failures
A major problem in networks is that of propagating failures. Research has shown that
about one third of all outage minutes of a client/server application are a result of errors
that started in one module and spread to other modules before those errors could be
isolated.
The following list shows some typical causes:
A distributed database is corrupt or frozen.
A broadcast storm occurred. The practice of broadcasting special messages to all
network hosts has been overused. Such overuse has the potential to disable the
network. Broadcast storms are usually caused by software errors.
A babbling node is transmitting random, meaningless packets onto the network.
Babbling nodes are often caused by a defective LAN card.
A router algorithm conflict causes two nodes to send packets back and forth
between each other. Each node calculates the shortest route to the packet’s
destination as being through the other node.