Availability Guide for Application Design

Overview of Server and Network Fault Tolerance

Availability Guide for Application Design—525637-004

2-16

Instrumentation of System Components

Instrumentation is another important technique for keeping system-level components

fully functional. Hardware or software error detection has already been discussed

earlier in this section, including the need to be able to isolate errant modules. Critical to

the success of these mechanisms is the ability to report errors to the operating system

or to operations staff so that they can respond appropriately.

Instrumentation helps keep the server system functional in the following ways:

•

Allowing the status of modules to be queried. In this way, operations staff can

receive informative data that enables them to schedule action to prevent outages.

•

Providing resource availability information using thresholds and alarms. This type

of information alerts operations when the availability of a resource is reaching a

critical level; for example, a disk is 95-percent full or a processor is 90-percent

busy. Human or automated operators can make a proactive response to prevent a

possible outage.

•

Providing information to help with online recovery. In this way, human or automated

operators can get the server system back up in the least possible time.

Two sets of techniques help provide system-level instrumentation:

•

Event messages and Distributed Systems Management (DSM) tools

•

First failure data capture and the HP Failure Data System (TFDS)

System-Level Instrumentation Using DSM

DSM provides a wide range of facilities for generating and handling event messages.

The source of such event messages can be HP subsystem software or application

software. Refer to Section 8, Instrumenting an Application for Availability, for

information on how to apply instrumentation to an application.

Many HP subsystems generate event messages for most of the events you need to be

aware of. You can use EMS programming techniques to generate additional messages

if necessary; refer to the EMS Manual for details. In addition, many subsystems

provide a Subsystem Programmatic Interface (SPI) that facilitates command and

control of the subsystem; refer to the SPI programming manual for the appropriate

subsystem.

Distribution and filtering of event messages is under your control. In addition, DSM

provides many tools and applications to help you handle the filtered messages. The

Availability Guide for Problem Management provides guidelines for these tasks.

First Failure Data Capture

Some system code contains embedded calls that report error conditions to the HP

Failure Data System (TFDS). The kinds of error conditions typically checked for

include: