EMS Manual

Standard Events
EMS Manual426909-005
9-8
Proactive Problem Management Functions
Proactive Problem Management Functions
Proactive problem management deals with managing problems that might, but have
not yet, occurred. This involves predicting, from received EMS events, whether to take
actions to prevent an object from becoming unavailable or performing at less than full
capacity. The Object Monitoring Facility (OMF) provides some of these functions.
Transient Faults
Transient faults are faults in the system that were automatically recovered by the
system—such as correctable memory error, retryable controller error, line or network
resets. These faults, if they persist, could lead to the loss of a system resource. Report
the Transient Fault event when the objects encounter the transient faults. To prevent
flooding the EMS collector, do not report the Transient Fault event for every encounter
if they take place in a very short time interval. Report the Transient Fault event only
after every few occurrences. If the transient fault occurs continuously, the subsystem or
application should consider the fault permanent and take the object out of service; in
this case, it should report an Object Unavailable event.
Use of System Resources
Use level of an object or resource can indicate a gradual degradation in the availability
of the object (for example, the use of the communication line is reaching its theoretical
limit) or it could signal the impending loss of an object (for example, a critical file is 80
percent full.) In general, any object that is critical to the operation of a subsystem or
application should be monitored, and the Usage Threshold event should be reported
when the usage level of the object exceeds the configured level.
Usually, subsystems and applications that control critical objects should monitor and
report the Usage Threshold events. For certain resources, however, they are better
monitored and reported outside the subsystems and applications that control or use
them. These resources are usually system-wide resources used by many subsystems
and applications.
The resources that subsystems and applications should monitor are:
Data communication line utilization—specifies the percentage of the theoretical
capacity of the line that is currently being used. The subsystem or application that
controls the line divides the throughput by the theoretical line speed (both in
number of bytes per second). Throughput is obtained by dividing the number of
bytes of data sent over a time period by the same time period.
Internal buffer usage—specifies the percentage of the buffer pool that is currently
being used. The currently used space (in bytes or other units) is divided by the total
space in the pool.
Task queue length—specifies the number of requests waiting for service in the
subsystem or application. A counter is incremented whenever a request is added
to the service queue and decremented whenever a request is removed from the
queue and serviced.