Availability Guide for Problem Management
Auditing Systems for Fault Tolerance
Availability Guide for Problem Management–125509
7-9
Testing Applications for Graceful Recovery
NonStop TM/MP has an additional benefit: it not only simplifies application design but
also extends fault tolerance to protect against multiple failures. For example, if both the
primary and mirror disk volumes on which a database resides suffer simultaneous head
crashes, NonStop TM/MP is able to recover the data.
Instrumenting Applications for Fault Tolerance
Even after you have installed the most reliable computer system and have followed
correct application design principles, end users can still lose the availability of the
application because of full-file or full-disk conditions or because an operational or
procedural error has stopped a critical application component.
Instrumentation allows you to anticipate and detect these kinds of problems and, in
many cases, can help apply a speedy solution. Tandem provides instrumentation for
most system-level subsystems. It is equally important to instrument your applications to
eliminate problems that might take the applications offline. To be sure that application
downtime is kept to a minimum, Tandem recommends that you apply instrumentation to
all critical modules in your application.
For more information about instrumenting your applications for fault tolerance, refer to
the Availability Guide for Application Design.
Testing Applications for Graceful Recovery
Applications should be tested for graceful recovery from full-file and full-disk
conditions, from processor failures, and from failures of data communications resources
external to the Tandem system.
Following Tandem Recommendations
One way to achieve software-configuration fault tolerance is to follow Tandem’s
recommendations for fault-tolerant systems. These recommendations are usually
documented in configuration and management manuals for the various Tandem
subsystems.
Using Process Pairs
You can use Guardian procedure calls to the operating system to provide fault tolerance
to your application by means of process pairs: A primary process runs the application,
while a secondary (backup) process in another processor module remains ready to take
over if the primary process fails. The primary process uses checkpoints to copy selected
parts of its environment to the backup process. Using this checkpoint information, the
backup process is able to take over from the primary process without interrupting
service to the user of the application.
The process-pair technique can be used to protect data that cannot be considered part of
a transaction and, therefore, cannot be protected by NonStop TM/MP. For example,
information that remains in memory and does not get written to disk.