Availability Guide for Application Design

Instrumenting an Application for Availability
Availability Guide for Application Design525637-004
8-14
Who to Notify?
Defining the Criteria That Indicate the Health of the
Application
Finally, you must define the criteria that monitor the health of your application and
convey this information to the human or automated operator. For example, you might
keep a count of transactions, monitor queue totals, or other application resources.
Who to Notify?
Depending on the error, you should inform:
The user
The system (human or automatic) operator
The user and the system operator
When to Tell the User
Not all errors indicate that a problem exists with the application. When an open
operation returns an error indicating that a file is not there, it could simply be that the
user mistyped a file name. As another example, a read error might result from an
attempt to read a nonexistent data field. In these cases, all your application needs to
do is inform the user of the mistake. The operator does not need to know.
When to Tell the Operator
Errors that indicate a threat to the availability of the application should be reported to
the human or automated operator responsible for the application. You can do this by
routing event messages to an EMS collector. The operator can then read these
messages from the corresponding consumer distributor and act upon them. Refer to
How Does EMS Collect, Filter, and Distribute Event Messages? on page 8-31, for
information on how to do this and for explanations of the terms “EMS collector” and
“consumer distributor.”
At this point, it is up to the human or automated operator to resume service to the user
as quickly as possible. Depending on what the problem is, the operator might, for
example, simply restart the application, perform file recovery, force the backup to
become the primary process, or switch to stand-in processing.
In addition, the operator should use the information available to analyze the problem by
determining if it is a new problem or a recurrence of a known problem. A log of such
problems can subsequently be used to determine the most common software faults as
an aid to continuous improvement.
When to Tell the User and the System Operator
Some errors that are reported to the system operator should also be reported to the
user if the error noticeably affects the users view of the application. For example, while
retry operations are attempting to resolve temporary problems, you should periodically