Availability Guide for Problem Management

Availability Guide for Problem Management–125509

7-1

Auditing Systems for Fault Tolerance

Overview

Auditing your system for fault tolerance is one of the most important ways to prevent

unplanned outages in your system environment. A fault-tolerance audit identifies any

potential problems that expose your online environment to unnecessary risk. Once these

problems are identified and resolved, you will have moved your system closer to your

goal of 24 hour-a-day, 7 day-a-week, 365 day-a-year (24x7x365) operations.

This section describes

•

Tandem’s philosophy of fault tolerance

•

How to audit your system for fault tolerance

•

How to configure your hardware and software for fault tolerance

Fault-Tolerant Operation

The hardware, operating system, and application environment of NonStop systems work

together to meet all of the demanding requirements of online processing, including fault

tolerance, system expandability, high performance, and effective networking.

The basic design philosophy of fault tolerance is that no single failure will stop or

contaminate the operating system and thus interrupt the delivery of service to end users.

This capability is called fault-tolerant operation. Redundant hardware, backup power

supplies, alternate data paths and bus paths, redundant controllers, and mirrored disks all

contribute to the fault tolerance of the operating system. The Introduction to Tandem

NonStop Systems describes these features in detail.

Continuous Operations

When configured properly, Tandem hardware is designed to provide two forms of

continuous operations when an individual component fails: continuous execution of

processes and continued access to databases. Figure 7-1 illustrates both of these design

goals.