Availability Guide for Problem Management

ManualsBrandsHP ManualsServerHP NonStop G-Series

Preventing Unplanned Outages

Availability Guide for Problem Management–125509

2-3

Preventing Problems From Becoming Outages

In most computer environments, the first goal of problem management is to reduce or

eliminate problems that can escalate into unplanned outages. Tandem systems are

designed to survive any single component failure, but not all double component failures.

Thus, if a single failure (for example, that of a power supply or processor) is not

detected and repaired in a timely manner, the system and applications are vulnerable to a

second failure of a related type.

One of the most important problem prevention tasks in a Tandem environment is to react

promptly to inconsequential single failures before they become catastrophic double

failures. Problem prevention is a way to accomplish this first goal of problem

management—reducing or eliminating unplanned outages.

Why Is Problem Prevention Important?

Problem prevention is important because of the possibility of substantial delays

whenever a system experiences a problem. Once a problem has occurred, it takes time to

recognize and log the problem, and to get someone to work on it. It also takes time to

collect the necessary data, analyze the problem, verify the cause, and fix the problem.

Additional time might be used to test and evaluate the fix and to put the system back into

operation. Because solving problems means a loss of availability, the best kind of

problem management is problem prevention.

Goals and Strategies

The major goals of problem prevention are:

•

Predicting potential problems before they occur

•

Preventing potential problems from becoming unplanned outages

•

Preparing for any problems that might occur, so as to reduce their impact

Using Tandem’s problem management tools, you can implement the following strategies

to predict, prevent, and prepare for many unplanned outages in your system

environment.

Predicting Potential Problems

Two important strategies for predicting potential problems before they occur are:

•

Managing system and application messages to ensure that:

•

Operators are quickly notified of error conditions, state changes, and threshold

limits that have been exceeded, before they escalate into unplanned outages.

•

Messages are logged and provide a chronological list of events to aid in problem

diagnosis and resolution.

•

There is a single source of information for both system and application events.

Section 4, “Monitoring Event Messages,” provides more information on this topic.

•

Monitoring critical objects in your system environment.