Availability Guide for Application Design

Instrumenting an Application for Availability
Availability Guide for Application Design525637-004
8-2
Design Philosophy for Error Handling
A discussion of automating object management on a server, using the Distributed
Systems Management (DSM) subsystem to illustrate the use of instrumentation in
management automation; refer to Automating Object Management on page 8-15.
This subsection provides:
A discussion of why you should use DSM and an overview of its architecture
An overview of the Subsystem Programmatic Interface (SPI), including
discussions of the purpose and content of SPI messages
An overview of the DSM subsystem environment, including introductions to the
tools and techniques you can use to instrument your application by using the
SPI interface in your application
An overview of DSM management services, including how the Event
Management Service (EMS) collects event messages and filters and
distributes them for consumption by management applications
An overview of the application’s operations environment, using examples from the
DSM operations environment; refer to The Operations Environment on page 8-36.
Although this section focuses heavily on the use of DSM and its tools, remember that
an application needs to provide appropriate information to whatever operations
software is in use for your servers.
Design Philosophy for Error Handling
Highly available applications should check for all possible error conditions and inform
the user or the system operator as appropriate. Of course, not all error returns indicate
a problem with the application software. Most errors typically indicate usage errors,
such as a user entering the name of a nonexisting file or entering some out-of-range
value, or indicate the need for an operational response such as mounting a tape. In
these circumstances, the application is functioning according to its specification.
For error conditions that might indicate a problem, your application must provide
recovery wherever possible and provide adequate diagnostic information when
recovery is not possible. Ideally, your application should provide the operator with
control over the amount and kind of detail collected for problem diagnosis.
Checking for Errors
No simple formula exists for determining the correct response that an application must
make to an error. The correct response varies greatly depending on the severity of the
error, the type of application, the type of user, and so on. Applications that require high
levels of availability, however, should try to avoid simply terminating on occurrence of
an error, unless some form of corruption is indicated. Your application and supporting
software should first try every means possible to keep the application running. If the