Availability Guide for Problem Management
Monitoring Event Messages
Availability Guide for Problem Management–125509
4-5
Step 1—Analyzing System Event Messages
Step 1—Analyzing System Event Messages
For each subsystem, you need to analyze the event messages, select the important event
messages, and estimate their severity. Important messages are any that report an
occurrence that might affect the availability of the system or network. Severity levels
might be defined as follows:
Because EMS event messages report on a wide range of occurrences and conditions,
they are divided into three classes:
•
Information events
•
Action events
•
Critical events
Information Events
Information events report changes in the status of a process or device that require no
further action by an operator or system management application.
Action Events
A subsystem reports an action event when a condition arises that the subsystem cannot
resolve without operator intervention. For example, a subsystem might report an action
event when it cannot proceed until a tape is mounted or a printer ribbon is replaced.
Critical Events
A subsystem designates an event as critical when the consequences of the event might
be severe. The subsystem identifies potentially critical situations and lets you (with the
help of any programmatic tools you select) make the final determination. Subsystems
typically identify the following events as critical:
•
Potential or actual loss of data
•
Loss of a major subsystem function
•
Loss of fault-tolerance, such as loss of a redundant resource or loss of a failure-
recovery function
•
Loss of subsystem integrity (for example, an unrecoverable internal error in a
subsystem)
Severity Level Meaning
Warning A potential problem has been detected.
Minor An unexpected state change has occurred, but the resource is still available.
Major The affected resource has lost fault tolerance.
Critical The affected resource is unavailable.