Availability Guide for Problem Management

Introduction to Problem Management

Availability Guide for Problem Management–125509

1-5

What Is Problem Management?

•

The famous nine-hour breakdown of a long-distance telephone network in early

1990 dramatized the vulnerability of complex computer systems everywhere. The

breakdown ultimately cost the company some $60 to $75 million in lost revenues,

averaging $130,000 per minute.

•

After a bomb exploded in the New York World Trade Center in 1992, one of the

banks in the building estimated lost revenues of $20 million per day, or $2,500 per

minute.

What Is Problem Management?

Problem management is a disciplined approach to managing and administering the

problem environment. Problem management includes monitoring, detecting, analyzing,

escalating, working around, and resolving problems in an online environment. Problem

management tasks include

•

Detecting, isolating, and analyzing problems

•

Resolving problems and analyzing their causes

•

Recovering from problems

•

Establishing problem-prevention techniques

This manual describes these tasks and provides guidelines for implementing them. It

also describes the Tandem tools available to help you manage problems and to increase

the availability of your online environment.

What Are the Goals of Problem Management?

The first goal of problem management is to reduce or eliminate problems that may

escalate into unplanned outages. This can be done by predicting and preventing

problems before they occur and by ensuring fault tolerance.

The second goal of problem management is to quickly recover from problems that do

result in unplanned outages.

Reducing or Eliminating Problems

In most computer environments, the first goal of operations management is to minimize

the number of unplanned outages per year. Tandem systems are designed to survive

single component failures, but not all double component failures. Thus, one of the most

important tasks in a Tandem environment is to react promptly to inconsequential single

failures before they become catastrophic double failures.

Section 2, “Preventing Unplanned Outages,” describes ways to prevent common causes

of unplanned outages.

Another way to prevent unplanned outages is to ensure your systems and applications

are fault tolerant by using process pairs, independent multiple processors, and mirrored

disk drives. Ensuring that your system is configured for fault tolerance reduces the

likelihood that a single component failure will contribute to an unplanned outage.