EMS Manual
Standard Events
EMS Manual—426909-005
9-6
Reactive Problem Management Functions
2. Problem diagnosis. An operator should be able to detect the specific cause of a
problem and the action required to resolve it.
3. Problem bypass and recovery. An operator should be able to bypass a problem, if
necessary, until the problem can be resolved. The decision whether to bypass or
not is a tradeoff between the cost to the enterprise due to the loss of a failed
component and the cost in providing the bypass capability.
4. Problem resolution. An operator should be able to initiate the action necessary to
repair or replace a failed component.
5. Problem tracking and control. An operator should be able to track a problem from
its detection through its final resolution. A database of previous problems helps
correlate incidents to their underlying root-cause problems and thereby helps
provide a timely recovery. The correlation mechanism should be able to determine
whether a problem is new or the recurrence of a known problem (problem
rediscovery).
After a problem occurs, many Object Unavailable events are usually reported by
subsystems and applications that are affected, directly or indirectly, by the problem.
The problem management application has the difficult tasks of isolating the event that
contains the actual cause of the problem from events that describe the effects of the
problem, and of determining whether the event is reporting a new problem or the
recurrence of a known problem. The next subsection describes how to perform these
tasks using the Object Unavailable event.
Identifying the Actual Cause of the Problem
The Object Unavailable event contains a token called change reason that indicates
why the object went out of service. Possible values for this field are:
Normally terminated. The object stopped normally.
Operator initiated. The operator took the object down.
System initiated and error is within subsystem itself. The object failed because of
an internal error.
System initiated and error is due to failure in underlying service. The object failed
because an underlying service on which this object depended failed.
Unknown. The problem is unknown (avoid using this value).
The first three values describe the actual failure causes, and operators do not have to
look for the failure in other subsystems or applications.
For failure in underlying service, the name of the dependent object is in the event. The
problem management application likely indicates that the object failed as a result of the
failure of another object, possibly an object in another subsystem or application. This
event contains information about the type and name of the underlying object that failed.