Availability Guide for Problem Management
Introduction to Problem Management
Availability Guide for Problem Management–125509
1-6
Recovering Quickly From Problems That Do Occur
Section 7, “Auditing Systems for Fault Tolerance,” describes the fault tolerant features of
the Tandem architecture that allow Tandem systems to tolerate single points of failure in
hardware and software.
Recovering Quickly From Problems That Do Occur
Despite the best planning and prevention, unplanned outages can still occur. When an
outage does occur, you need to follow a systematic approach to resolving the problem
and recovering quickly. Systematic problem solving consists of the following steps:
1. Detecting and isolating the problem
2. Gathering the facts and reporting the problem
3. Identifying the cause, and developing and implementing a solution
4. Escalating the problem, if necessary
5. Reviewing the problem (focusing on prevention)
Section 3, “Recovering From Unplanned Outages,” describes these steps in detail and
describes how to get your system or application back online after an unplanned outage.
Focusing on Problems That Can Cause Unplanned Outages
This manual focuses on problems that can cause unplanned outages. Section 2,
“Preventing Unplanned Outages,” describes ways to prevent unplanned outages. Other
sections describe strategies for preventing problems from becoming unplanned outages
that affect your system’s availability, as follows:
•
Section 4, “Monitoring Event Messages,” describes ways to predict, prevent, and
detect problems by effectively managing system and application messages.
•
Section 5, “Monitoring Objects,” describes how to predict, prevent, and detect
problems by monitoring objects critical to maintaining system and application
availability.
•
Section 6, “Automating Operations and Recovery Procedures,” describes how to
predict, prevent, detect, and recover from problems by automating operations and
recovery procedures.
•
Section 8, “Planning for Disasters,” describes ways to help you prevent, prepare for,
and recover from a disaster.
“About This Manual” lists sources for problem-solving information, such as other
manuals, Tandem’s professional services, and classes offered by the Tandem Education
Group.