Availability Guide for Application Design

ManualsBrandsHP ManualsServerHP NonStop G-Series

211

212

213

214

215

216

217

218

219

220

Instrumenting an Application for Availability

Availability Guide for Application Design—525637-004

8-6

Writing Code to Handle Problem Errors

Does the program send a message to the operator screen?

Are the operators available 24 hours a day?

Does the company have a pager system for operations or support personnel?

How does that software work on the server system? (It is probably a matter of

filtering the message and routing it to an automation program to send it to the

pager broadcast system.)

•

Does the site have an automation program? How does that program work?

Should the program create a special message token field that is easily searched

on to indicate “file error, restartable - operator action urgently needed” with the

error number in a standard token field?

Similar concerns exist if a double processor failure occurs:

•

What file system error or system message does the program look for to identify the

error? For example, file system error 66 indicates the loss of a mirrored disk.

•

Does the application terminate, or should it run until the problem is solved?

Again, if the program continues to run, it must loop and wait until the processors

restart and TMF fixes the data situation so that the program can process data

again.

•

If the site is automated, does the program require code that can recognise the

extent of the problem?

Recovery needs potentially many processes restarted; which processes will have

to move automatically to other processors on persistent process restart, and which

are process pairs in the affected processors? Process pairs will need to be

restarted after processor restart. This error handling might be coded into the

application or set as rules in an automation engine.

•

Does the program make it feasible for support people to find the problem? Will the

program:

Issue them a log message (ideally, to both the operator console and a disk

log)?

Capture the file system errors which originated the problem?

•

Does the application need a status database that defines the state of the parts of

the application, which processes are still runnin,g and which processes are dead?

If a process is still running, we need to know what indicates its condition:

Is performance reduced?

Are the queues in the queue file too long?

The ability to cope with these questions can depend on how well the application is

instrumented.