Availability Guide for Problem Management
Preventing Unplanned Outages
Availability Guide for Problem Management–125509
2-3
Preventing Problems From Becoming Outages
Preventing Problems From Becoming Outages
In most computer environments, the first goal of problem management is to reduce or
eliminate problems that can escalate into unplanned outages. Tandem systems are
designed to survive any single component failure, but not all double component failures.
Thus, if a single failure (for example, that of a power supply or processor) is not
detected and repaired in a timely manner, the system and applications are vulnerable to a
second failure of a related type.
One of the most important problem prevention tasks in a Tandem environment is to react
promptly to inconsequential single failures before they become catastrophic double
failures. Problem prevention is a way to accomplish this first goal of problem
management—reducing or eliminating unplanned outages.
Why Is Problem Prevention Important?
Problem prevention is important because of the possibility of substantial delays
whenever a system experiences a problem. Once a problem has occurred, it takes time to
recognize and log the problem, and to get someone to work on it. It also takes time to
collect the necessary data, analyze the problem, verify the cause, and fix the problem.
Additional time might be used to test and evaluate the fix and to put the system back into
operation. Because solving problems means a loss of availability, the best kind of
problem management is problem prevention.
Goals and Strategies
The major goals of problem prevention are:
•
Predicting potential problems before they occur
•
Preventing potential problems from becoming unplanned outages
•
Preparing for any problems that might occur, so as to reduce their impact
Using Tandem’s problem management tools, you can implement the following strategies
to predict, prevent, and prepare for many unplanned outages in your system
environment.
Predicting Potential Problems
Two important strategies for predicting potential problems before they occur are:
•
Managing system and application messages to ensure that:
•
Operators are quickly notified of error conditions, state changes, and threshold
limits that have been exceeded, before they escalate into unplanned outages.
•
Messages are logged and provide a chronological list of events to aid in problem
diagnosis and resolution.
•
There is a single source of information for both system and application events.
Section 4, “Monitoring Event Messages,” provides more information on this topic.
•
Monitoring critical objects in your system environment.