EMS Manual

ManualsBrandsHP ManualsServerHP NonStop G-Series

191

192

193

194

195

196

197

198

199

200

Standard Events

EMS Manual—426909-005

9-8

Proactive Problem Management Functions

Proactive problem management deals with managing problems that might, but have

not yet, occurred. This involves predicting, from received EMS events, whether to take

actions to prevent an object from becoming unavailable or performing at less than full

capacity. The Object Monitoring Facility (OMF) provides some of these functions.

Transient Faults

Transient faults are faults in the system that were automatically recovered by the

system—such as correctable memory error, retryable controller error, line or network

resets. These faults, if they persist, could lead to the loss of a system resource. Report

the Transient Fault event when the objects encounter the transient faults. To prevent

flooding the EMS collector, do not report the Transient Fault event for every encounter

if they take place in a very short time interval. Report the Transient Fault event only

after every few occurrences. If the transient fault occurs continuously, the subsystem or

application should consider the fault permanent and take the object out of service; in

this case, it should report an Object Unavailable event.

Use of System Resources

Use level of an object or resource can indicate a gradual degradation in the availability

of the object (for example, the use of the communication line is reaching its theoretical

limit) or it could signal the impending loss of an object (for example, a critical file is 80

percent full.) In general, any object that is critical to the operation of a subsystem or

application should be monitored, and the Usage Threshold event should be reported

when the usage level of the object exceeds the configured level.

Usually, subsystems and applications that control critical objects should monitor and

report the Usage Threshold events. For certain resources, however, they are better

monitored and reported outside the subsystems and applications that control or use

them. These resources are usually system-wide resources used by many subsystems

and applications.

The resources that subsystems and applications should monitor are:



Data communication line utilization—specifies the percentage of the theoretical

capacity of the line that is currently being used. The subsystem or application that

controls the line divides the throughput by the theoretical line speed (both in

number of bytes per second). Throughput is obtained by dividing the number of

bytes of data sent over a time period by the same time period.



Internal buffer usage—specifies the percentage of the buffer pool that is currently

being used. The currently used space (in bytes or other units) is divided by the total

space in the pool.



Task queue length—specifies the number of requests waiting for service in the

subsystem or application. A counter is incremented whenever a request is added

to the service queue and decremented whenever a request is removed from the

queue and serviced.