Availability Guide for Application Design

Instrumenting an Application for Availability
Availability Guide for Application Design525637-004
8-6
Writing Code to Handle Problem Errors
Does the program send a message to the operator screen?
Are the operators available 24 hours a day?
Does the company have a pager system for operations or support personnel?
How does that software work on the server system? (It is probably a matter of
filtering the message and routing it to an automation program to send it to the
pager broadcast system.)
Does the site have an automation program? How does that program work?
Should the program create a special message token field that is easily searched
on to indicate “file error, restartable - operator action urgently needed” with the
error number in a standard token field?
Similar concerns exist if a double processor failure occurs:
What file system error or system message does the program look for to identify the
error? For example, file system error 66 indicates the loss of a mirrored disk.
Does the application terminate, or should it run until the problem is solved?
Again, if the program continues to run, it must loop and wait until the processors
restart and TMF fixes the data situation so that the program can process data
again.
If the site is automated, does the program require code that can recognise the
extent of the problem?
Recovery needs potentially many processes restarted; which processes will have
to move automatically to other processors on persistent process restart, and which
are process pairs in the affected processors? Process pairs will need to be
restarted after processor restart. This error handling might be coded into the
application or set as rules in an automation engine.
Does the program make it feasible for support people to find the problem? Will the
program:
°
Issue them a log message (ideally, to both the operator console and a disk
log)?
°
Capture the file system errors which originated the problem?
Does the application need a status database that defines the state of the parts of
the application, which processes are still runnin,g and which processes are dead?
If a process is still running, we need to know what indicates its condition:
°
Is performance reduced?
°
Are the queues in the queue file too long?
The ability to cope with these questions can depend on how well the application is
instrumented.