EMS Manual

ManualsBrandsHP ManualsServerHP NonStop G-Series

191

192

193

194

195

196

197

198

199

200

Standard Events

EMS Manual—426909-005

9-6

Reactive Problem Management Functions

2. Problem diagnosis. An operator should be able to detect the specific cause of a

problem and the action required to resolve it.

3. Problem bypass and recovery. An operator should be able to bypass a problem, if

necessary, until the problem can be resolved. The decision whether to bypass or

not is a tradeoff between the cost to the enterprise due to the loss of a failed

component and the cost in providing the bypass capability.

4. Problem resolution. An operator should be able to initiate the action necessary to

repair or replace a failed component.

5. Problem tracking and control. An operator should be able to track a problem from

its detection through its final resolution. A database of previous problems helps

correlate incidents to their underlying root-cause problems and thereby helps

provide a timely recovery. The correlation mechanism should be able to determine

whether a problem is new or the recurrence of a known problem (problem

rediscovery).

After a problem occurs, many Object Unavailable events are usually reported by

subsystems and applications that are affected, directly or indirectly, by the problem.

The problem management application has the difficult tasks of isolating the event that

contains the actual cause of the problem from events that describe the effects of the

problem, and of determining whether the event is reporting a new problem or the

recurrence of a known problem. The next subsection describes how to perform these

tasks using the Object Unavailable event.

Identifying the Actual Cause of the Problem

The Object Unavailable event contains a token called change reason that indicates

why the object went out of service. Possible values for this field are:



Normally terminated. The object stopped normally.



Operator initiated. The operator took the object down.



System initiated and error is within subsystem itself. The object failed because of

an internal error.



System initiated and error is due to failure in underlying service. The object failed

because an underlying service on which this object depended failed.



Unknown. The problem is unknown (avoid using this value).

The first three values describe the actual failure causes, and operators do not have to

look for the failure in other subsystems or applications.

For failure in underlying service, the name of the dependent object is in the event. The

problem management application likely indicates that the object failed as a result of the

failure of another object, possibly an object in another subsystem or application. This

event contains information about the type and name of the underlying object that failed.