Introduction to NonStop Operations Management

Problem Management
Introduction to NonStop Operations Management125507
6-7
Step 1—Detecting and Isolating the Problem
Step 1—Detecting and Isolating the Problem
To detect problems quickly, operators must be aware that a problem exists. Some of the
same techniques used to predict and prevent problems are also used to determine if a
problem exists. These are:
Monitoring hardware and software.
Monitoring system and application software message logs.
Using Tandem Service Management (TSM) tools, including the TSM EMS Event
Viewer. TSM uses expert systems technology to detect, analyze, diagnose, and
archive hardware problems as they occur—often detecting failures before they affect
system performance.
Automating monitoring tasks and recovery procedures.
Receiving information from a user or from users indicating that a problem exists.
To ensure that problems are detected as quickly as possible, establish procedures for
monitoring the system and logs, and for receiving information from users. For guidelines
to help you develop monitoring procedures, refer to The Availability Guide for Problem
Management.
Step 2—Gathering the Facts and Reporting the Problem
After a problem is detected, it is usually reported. Consider establishing procedures for
reporting problems. Established procedures help you track:
Each problem that occurs
How the problem was resolved
Who resolved the problem and when
Recurring problems
How long it took to resolve the problem
Whether a problem can be prevented or recovery procedures for that problem can be
automated
If all problems are logged, your staff can generate weekly or monthly summaries that
allow you to evaluate system and staff performance and focus on problem areas.