Availability Guide for Problem Management

Monitoring Event Messages
Availability Guide for Problem Management125509
4-4
Getting Control of System Event Message
Management
Monitoring a running network or system. Your own management application can be
used to recognize situations needing attention as they arise. Depending on the
problem and the sophistication of the application, the problem can then be resolved
by the operator or the application through the appropriate command-response
interface.
Managing operator tasks. You can use a management application, or a distributor
that routes event messages to a display device, to select and display action event
messages—those requesting operator attention. If operators see an integrated picture
of the required tasks, they can direct their attention to accomplishing the tasks, not
to finding out what they are. This will help the operators maintain an efficient
processing environment. Other opportunities for managing operator tasks include:
Reformatting system event messages to make it easier for the operator to
respond and react to them
Automating operations and thus eliminating some of the operator’s tasks
Distributing event classes to specific operators
Using multiple monitors to better organize the operations staff and their work
load
Analyzing problems. To determine what went wrong and why, it is often necessary
to retrace a series of events leading up to the problem condition. Log files, filters,
and other EMS components can help you or your Tandem representative in this task
by providing the historical record and the tools needed to sift through the problem.
Detecting potential problems. You can write a management application to analyze
collector log files for repeated warnings or persistent minor problems that, singly,
would not require a recovery action but, taken together, might disclose a correctable
condition.
Getting Control of System Event Message
Management
Managing system event messages effectively to predict, prevent, and detect unplanned
outages is essential if you hope to meet the service-level objectives required by your end
users.
The steps necessary to controlling your system event include:
1. Analyzing system event messages
2. Filtering system event messages
3. Writing operations and recovery procedures
4. Automating operations and recovery procedures