Availability Guide for Problem Management
Automating Operations and Recovery Procedures
Availability Guide for Problem Management–125509
6-2
Ensure That Messages Are Being Managed
Efficiently
Ensure That Messages Are Being Managed Efficiently
Managing system Event Management Service (EMS) event messages is an important
part of your automation strategy because it allows operators to be notified quickly of
error conditions, state changes, and threshold limits that have been exceeded. Critical
events can be highlighted on the system console. In some instances, problems can be
prevented if operators are able to read and react quickly to system event messages. For
example, if one processor goes down because of a problem with a communications line,
and the operator does not respond to the event, problems with a second processor can
bring down the whole system. Some programs can perform automatic recovery in
response to events.
Managing your application event messages is important because it helps reduce
information overload. Operators can be notified quickly of error conditions and state
changes that can affect the availability of your applications. Operators can focus on and
react to critical events only. Critical events can be highlighted on the system console.
The operator can receive an online description of the problem and the recommended
procedures for handling the problem. Some programs can perform automatic recovery in
response to events. Managing application event messages also provides a chronological
list of events to aid in problem detection and resolution.
Section 4, “Monitoring Event Messages,” provides more information on this topic.
Ensure That Important Objects Are Being Monitored
When something goes wrong with a system or application, an EMS event message is
displayed on the operations console. Because objects are interdependent, it is not
uncommon for other objects to be affected when something goes wrong (thus generating
associated event messages). You must then determine the scope of the problem and the
root cause before taking any action.
Object monitoring is important to your automation strategy because it allows you to take
a proactive approach to eliminating outages. For example, when a disk or file becomes
full and operators are not aware of this disk-full or file-full condition, a serious problem
may occur. Problem conditions should be detected before they have a negative impact on
system or application availability. By becoming aware of potential problems rather than
waiting for problems to occur, you may be able to prevent unplanned outages in your
environment. Object monitoring allows you to react promptly to single failures before
they become catastrophic double failures.
Section 5, “Monitoring Objects,” provides more information on this topic.