Availability Guide for Application Design
Instrumenting an Application for Availability
Availability Guide for Application Design—525637-004
8-22
The Subsystem Programmatic Interface (SPI)
applications might rely entirely on an event message log for its input and an
unreported state change could cause confusion.
In addition to reporting the state change, such an event message should also
contain the following information:
•
The name of the object changing state
•
The state the object is changing from and the state it is changing to
•
The reason for the state change
•
An unexpected error condition or process failure has occurred.
Data loss or hardware or software failure are typically reported by this kind of event
message.
In addition to reporting the error condition itself, these event messages should also
contain:
•
The name of the object which experienced the error or failure
•
The trap number, termination status, or error code with appropriate detail or
passthrough errors
•
The type or class of error; for example, file-system, PATHMON, programming
exception, trap, and so on
•
An operator action is required.
This kind of event indicates that your application cannot continue without operator
intervention.
In addition to indicating that intervention is required, these event messages should
include the following information:
•
The name of the object that needs attention, such as the device or file
•
The action required; for example, mount a specific tape
•
A tag to identify this event in a subsequent message
•
Operator intervention is no longer required.
This kind of event indicates to the management application that the problem
indicated by an operator-action-required message has been fixed. The
management application can then take whatever steps are appropriate. It can, for
example, remove the event from a list of outstanding operator actions or dim the
entry on the console screen.
In addition to indicating that a problem is fixed, these event messages typically
include:
•
The name of the affected object
•
A tag to identify the original problem by matching this response with the event
that indicated the need for operator action
•
An object has crossed a usage threshold.