Availability Guide for Problem Management
Availability Guide for Problem Management–125509
7-1
7
Auditing Systems for Fault Tolerance
Overview
Auditing your system for fault tolerance is one of the most important ways to prevent
unplanned outages in your system environment. A fault-tolerance audit identifies any
potential problems that expose your online environment to unnecessary risk. Once these
problems are identified and resolved, you will have moved your system closer to your
goal of 24 hour-a-day, 7 day-a-week, 365 day-a-year (24x7x365) operations.
This section describes
•
Tandem’s philosophy of fault tolerance
•
How to audit your system for fault tolerance
•
How to configure your hardware and software for fault tolerance
Fault-Tolerant Operation
The hardware, operating system, and application environment of NonStop systems work
together to meet all of the demanding requirements of online processing, including fault
tolerance, system expandability, high performance, and effective networking.
The basic design philosophy of fault tolerance is that no single failure will stop or
contaminate the operating system and thus interrupt the delivery of service to end users.
This capability is called fault-tolerant operation. Redundant hardware, backup power
supplies, alternate data paths and bus paths, redundant controllers, and mirrored disks all
contribute to the fault tolerance of the operating system. The Introduction to Tandem
NonStop Systems describes these features in detail.
Continuous Operations
When configured properly, Tandem hardware is designed to provide two forms of
continuous operations when an individual component fails: continuous execution of
processes and continued access to databases. Figure 7-1 illustrates both of these design
goals.