Availability Guide for Problem Management

ManualsBrandsHP ManualsServerHP NonStop G-Series

100

Automating Operations and Recovery Procedures

Availability Guide for Problem Management–125509

6-3

Ensure That Recovery Procedures Are Fully

Documented and Tested

Ensure That Recovery Procedures Are Fully Documented and Tested

Before attempting to automate your operations and recovery procedures, you need to

ensure that they are fully documented and tested. Documenting and testing your

procedures should be done as part of your system and message management strategy:

identifying important messages, defining their severity, and documenting the recovery

steps. This will help you produce a document, sometimes known as an operations

runbook, that specifies the critical events and describes how operators should react to

them. This runbook can also be used to help build a set of EMS filters that select only

the events that are relevant to your users’ environment. In addition, your runbook will

help you define your operational policies as well as procedures.

Section 3, “Recovering From Unplanned Outages,” and Section 4, “Monitoring Event

Messages,” provide more information on this topic.

What Tasks Can Be Automated on Tandem

Systems?

Your automation strategy should include each of the following types of operations

management tasks:

•

Object state monitoring

•

Critical resource monitoring

•

Intervention and recovery tasks

•

Repetitive tasks

•

Problem determination steps

•

Starting batch jobs

Object State Monitoring

Object state monitoring is an operations management task that can be automated easily

to help you determine whether objects in your system environment are in an up, down,

unknown, or odd state.

Section 5, “Monitoring Objects,” provides more detailed information on this topic.

Critical Resource Monitoring

The usage level of an object or resource might indicate a gradual degradation in the

availability of the object (for example, the utilization of the communication line is

reaching its theoretical limit), or it could signal the impending loss of an object (for

example, a critical file is 80 percent full). In general, any object that is critical to the

operation of a system or application should be monitored, and a usage threshold event

should be reported when the usage level of the object exceeds the configured level.

Monitoring critical resources is another task that can be automated.

Defining threshold limits for and monitoring the utilization of the following critical

objects can help you prevent the loss of applications and end-user services:

•

Disk files and volumes percent full

•

Memory queues