Availability Guide for Application Design

Overview of Server and Network Fault Tolerance
Availability Guide for Application Design525637-004
2-16
Instrumentation of System Components
Instrumentation of System Components
Instrumentation is another important technique for keeping system-level components
fully functional. Hardware or software error detection has already been discussed
earlier in this section, including the need to be able to isolate errant modules. Critical to
the success of these mechanisms is the ability to report errors to the operating system
or to operations staff so that they can respond appropriately.
Instrumentation helps keep the server system functional in the following ways:
Allowing the status of modules to be queried. In this way, operations staff can
receive informative data that enables them to schedule action to prevent outages.
Providing resource availability information using thresholds and alarms. This type
of information alerts operations when the availability of a resource is reaching a
critical level; for example, a disk is 95-percent full or a processor is 90-percent
busy. Human or automated operators can make a proactive response to prevent a
possible outage.
Providing information to help with online recovery. In this way, human or automated
operators can get the server system back up in the least possible time.
Two sets of techniques help provide system-level instrumentation:
Event messages and Distributed Systems Management (DSM) tools
First failure data capture and the HP Failure Data System (TFDS)
System-Level Instrumentation Using DSM
DSM provides a wide range of facilities for generating and handling event messages.
The source of such event messages can be HP subsystem software or application
software. Refer to Section 8, Instrumenting an Application for Availability, for
information on how to apply instrumentation to an application.
Many HP subsystems generate event messages for most of the events you need to be
aware of. You can use EMS programming techniques to generate additional messages
if necessary; refer to the EMS Manual for details. In addition, many subsystems
provide a Subsystem Programmatic Interface (SPI) that facilitates command and
control of the subsystem; refer to the SPI programming manual for the appropriate
subsystem.
Distribution and filtering of event messages is under your control. In addition, DSM
provides many tools and applications to help you handle the filtered messages. The
Availability Guide for Problem Management provides guidelines for these tasks.
First Failure Data Capture
Some system code contains embedded calls that report error conditions to the HP
Failure Data System (TFDS). The kinds of error conditions typically checked for
include: