Availability Guide for Application Design

Overview of Server and Network Fault Tolerance

Availability Guide for Application Design—525637-004

2-18

Additional Availability Problems in Client/Server

Networks

application, it is necessary that all components connecting the user to the server are

available.

Measuring Downtime of a Client/Server Application

Client/server designs also further complicate the way downtime must be measured. A

transient system error in a workstation is clearly a problem to the user of the

workstation; the application is unavailable to that user, but other users are not affected.

A transient error in the server, however, is more serious because potentially thousands

of users could be depending on its services.

In a client/server application, it therefore makes sense to measure downtime as the

number of minutes the application is unavailable multiplied by the number of affected

users. If the transient error in the workstation makes the application unavailable to one

user for 5 minutes, then it counts as 5 user-minutes of downtime. If the problem on the

server makes the application unavailable for 15 minutes to 100 users, then it counts as

1500 user-minutes of downtime.

Where the Problems Occur

Research has established that, using commodity servers, defects in the server are

responsible for about 60 percent of all user downtime. The network is responsible for

about 10 percent. The remaining 30 percent of user downtime is divided between the

client and environmental causes. The fact that servers are the primary cause of end-

user outages is no surprise because any problem is magnified by the number of users

that are using that service.

Propagating Failures

A major problem in networks is that of propagating failures. Research has shown that

about one third of all outage minutes of a client/server application are a result of errors

that started in one module and spread to other modules before those errors could be

isolated.

The following list shows some typical causes:

•

A distributed database is corrupt or frozen.

•

A broadcast storm occurred. The practice of broadcasting special messages to all

network hosts has been overused. Such overuse has the potential to disable the

network. Broadcast storms are usually caused by software errors.

•

A babbling node is transmitting random, meaningless packets onto the network.

Babbling nodes are often caused by a defective LAN card.

•

A router algorithm conflict causes two nodes to send packets back and forth

between each other. Each node calculates the shortest route to the packet’s

destination as being through the other node.