Availability Guide for Problem Management
Recovering From Unplanned Outages
Availability Guide for Problem Management–125509
3-2
Step 1—Detecting and Isolating the Problem
Step 1—Detecting and Isolating the Problem
To respond to problems quickly, operations personnel must be aware that a problem
exists. Active system monitoring can help reduce the time needed to detect and resolve
problems. Tandem provides a number of tools to help you perform the three basic tasks
of problem detection and isolation:
•
Monitoring Messages
•
Monitoring Objects
•
Monitoring Performance
Monitoring Messages
Monitoring system and application event messages, which advise you about the health
and status of your system, is critical to achieving high availability in your online
environment. Section 4, “Monitoring Event Messages,” provides more detailed
information on this topic.
Monitoring hardware messages allows you to detect single-component failures that
could become multiple-component failures and cause a serious outage.
Tools for Monitoring Messages
Monitor system and application event messages using:
•
Tandem Service Management package (TSM) EMS Event Viewer
•
CA-Unicenter for Tandem Event Management function
•
Open Notification Service (ONS)
Monitor hardware messages using:
•
Tandem Service Management package (TSM)
TSM EMS Event Viewer
The TSM EMS Event Viewer assists you in performing many of the tasks associated
with viewing and monitoring various event logs. Features include:
•
Serial, summary, expanded, and detailed views of events
•
Multiple event sources, including merged and imported logs
•
Flexible event selection/exclusion (including by subsystem, device, priority)
•
Flexible event display (single line, full detail, token selection, color/emphasis)
•
Filter manipulation functions to refine and focus a view of the events
•
Ability to define and save views
The TSM EMS Event Viewer is launched from within the TSM application.