Availability Guide for Problem Management

Auditing Systems for Fault Tolerance
Availability Guide for Problem Management125509
7-4
Performing a Fault-Tolerance Audit
Performing a Fault-Tolerance Audit
If a crisis is prepared for, it becomes much less of a crisis. To handle a wide variety of
problems requires detailed study and preparation. One of the best ways to prepare for
and prevent problems that can cause unplanned outages is to perform a detailed risk
analysis. The fault-tolerance audit, which is one type of risk analysis, can tell you what
is vulnerable in your system environment and to what extent your system environment is
exposed to preventable problems. Auditing your system for fault tolerance can help you
to understand how critical continuous operations and available applications and
databases are to your business.
Once the level of risk is understood, you can implement plans to minimize your system’s
exposure. You should create plans and tests for all levels of problems, from recovery of a
failed drive in a mirrored pair, to moving processing to a remote site and switching over
control to the other system. Your plans should be thorough and they should be tested.
Tandems Professional Audit Services
Tandem’s Professional Audit Services allow you benefit from Tandem’s expertise in
availability by having your application/service environment audited to identify areas
where application availability can be improved.
A fault-tolerance audit helps you to determine whether your system can survive the loss
of a resource or object and whether the resource or object can be reintegrated into your
system successfully. To ensure a successful audit of your system’s fault tolerance, you
should
Stop or remove each resource or object, simulating both single and multiple
component failures
Carefully analyze the effects of these failures
Restore services made unavailable by the failures
Evaluate and, where applicable, automate recovery operations