Introduction to NonStop Operations Management

Introduction to NonStop Operations Management125507
6-1
6
Problem Management
Overview
No matter how well-managed your system is, errors and problems can occur. Because a
problem can mean the loss of availability, your staff needs to know how to report and
resolve the problem. If your staff cannot resolve the problem, it must know how to
escalate the problem so that recovery occurs.
This section describes “problem management” and provides suggestions, guidelines, and
tools for administering problems in an operations environment. This section ends with a
check list that summarizes the main points of problem management.
What Is Problem Management?
Problem management involves managing and administering the problem environment
including capabilities to monitor, detect, analyze, escalate, work around, and resolve
problems in an online environment.
The Goals of Problem Management
The goal of problem management is to reduce or eliminate problems. This can be done
by:
Predicting and then preventing problems before they occur
Quickly recovering from problems that do occur by using a systematic approach to
resolving problems
Predicting, preventing, and recovering from problems are described later in this section.
Common Problems in an Operations Environment
Many problems result in unplanned outages. An unplanned outage is the time in which
the application or system becomes unavailable to the end user because of a problem
situation such as faulty hardware, operator error, disaster, and so forth.
Tandem defines four unplanned outage classes, which categorize the causes of
unplanned outages. Table 6-1 defines the four outage classes.
Note. Participating in application design reviews can help your staff eliminate potential
problem areas and ensure that errors and recovery procedures are documented and
understandable. Building quality into an application reduces the chance of problems once the
application is installed. Section 11, “Application Management,” provides suggestions for
participating in design reviews.
The Availability Guide for Problem Management defines problem management in detail,
providing information on how to predict, prevent, and recover from problems; which problem
management tools to use; and how the tools fit together.