Availability Guide for Application Design

Instrumenting an Application for Availability
Availability Guide for Application Design525637-004
8-8
How Does Instrumentation Improve Availability?
How Does Instrumentation Improve Availability?
The key to reducing application downtime though instrumentation is understanding
what happens when a problem occurs and what needs to happen before the problem
is fixed and the application is back online. Figure 8-2 on page 8-8 shows the steps
involved.
As shown in the figure, if a problem occurs and takes the application offline, the
following tasks must finish before the application comes back online:
1. Detect the failure.
Some mechanism must be in place to detect and report the failure to a human or
automated operator.
2. Analyze the failure.
The failure must be analyzed to determine exactly what went wrong and the
circumstances under which the failure occurred.
3. Resolve the failure.
Find out what needs to be done to fix the problem and ensure that it does not
happen again.
4. Recover from the failure.
Implement the solution determined in Step 3 to recover from the failure.
Clearly, any of these phases potentially presents a significant period of outage time for
the application. Hence there is a clear need to do one of the following:
Avoid the occurrence of the problem.
If the problem cannot be avoided, use automated procedures wherever possible to
detect, analyze, and correct the problem in the shortest possible time.
Figure 8-2. Failure Detection, Analysis, Resolution, and Recovery
Failure
Occurs
Failure
Solved
Application Up Application UpApplication Down
Failure
Detection
Failure
Analysis
Failure
Resolution
Failure
Recovery
Time
VST702.vdd