Availability Guide for Problem Management
Introduction to Problem Management
Availability Guide for Problem Management–125509
1-5
What Is Problem Management?
•
The famous nine-hour breakdown of a long-distance telephone network in early 
1990 dramatized the vulnerability of complex computer systems everywhere. The 
breakdown ultimately cost the company some $60 to $75 million in lost revenues, 
averaging $130,000 per minute.
•
After a bomb exploded in the New York World Trade Center in 1992, one of the 
banks in the building estimated lost revenues of $20 million per day, or $2,500 per 
minute.
What Is Problem Management?
Problem management is a disciplined approach to managing and administering the 
problem environment. Problem management includes monitoring, detecting, analyzing, 
escalating, working around, and resolving problems in an online environment. Problem 
management tasks include
•
Detecting, isolating, and analyzing problems
•
Resolving problems and analyzing their causes
•
Recovering from problems
•
Establishing problem-prevention techniques
This manual describes these tasks and provides guidelines for implementing them. It 
also describes the Tandem tools available to help you manage problems and to increase 
the availability of your online environment.
What Are the Goals of Problem Management?
The first goal of problem management is to reduce or eliminate problems that may 
escalate into unplanned outages. This can be done by predicting and preventing 
problems before they occur and by ensuring fault tolerance.
The second goal of problem management is to quickly recover from problems that do 
result in unplanned outages.
Reducing or Eliminating Problems
In most computer environments, the first goal of operations management is to minimize 
the number of unplanned outages per year. Tandem systems are designed to survive 
single component failures, but not all double component failures. Thus, one of the most 
important tasks in a Tandem environment is to react promptly to inconsequential single 
failures before they become catastrophic double failures. 
Section 2, “Preventing Unplanned Outages,” describes ways to prevent common causes 
of unplanned outages.
Another way to prevent unplanned outages is to ensure your systems and applications 
are fault tolerant by using process pairs, independent multiple processors, and mirrored 
disk drives. Ensuring that your system is configured for fault tolerance reduces the 
likelihood that a single component failure will contribute to an unplanned outage.










