Availability Guide for Problem Management

Introduction to Problem Management
Availability Guide for Problem Management125509
1-5
What Is Problem Management?
The famous nine-hour breakdown of a long-distance telephone network in early
1990 dramatized the vulnerability of complex computer systems everywhere. The
breakdown ultimately cost the company some $60 to $75 million in lost revenues,
averaging $130,000 per minute.
After a bomb exploded in the New York World Trade Center in 1992, one of the
banks in the building estimated lost revenues of $20 million per day, or $2,500 per
minute.
What Is Problem Management?
Problem management is a disciplined approach to managing and administering the
problem environment. Problem management includes monitoring, detecting, analyzing,
escalating, working around, and resolving problems in an online environment. Problem
management tasks include
Detecting, isolating, and analyzing problems
Resolving problems and analyzing their causes
Recovering from problems
Establishing problem-prevention techniques
This manual describes these tasks and provides guidelines for implementing them. It
also describes the Tandem tools available to help you manage problems and to increase
the availability of your online environment.
What Are the Goals of Problem Management?
The first goal of problem management is to reduce or eliminate problems that may
escalate into unplanned outages. This can be done by predicting and preventing
problems before they occur and by ensuring fault tolerance.
The second goal of problem management is to quickly recover from problems that do
result in unplanned outages.
Reducing or Eliminating Problems
In most computer environments, the first goal of operations management is to minimize
the number of unplanned outages per year. Tandem systems are designed to survive
single component failures, but not all double component failures. Thus, one of the most
important tasks in a Tandem environment is to react promptly to inconsequential single
failures before they become catastrophic double failures.
Section 2, “Preventing Unplanned Outages,” describes ways to prevent common causes
of unplanned outages.
Another way to prevent unplanned outages is to ensure your systems and applications
are fault tolerant by using process pairs, independent multiple processors, and mirrored
disk drives. Ensuring that your system is configured for fault tolerance reduces the
likelihood that a single component failure will contribute to an unplanned outage.