Availability Guide for Problem Management
Automating Operations and Recovery Procedures
Availability Guide for Problem Management–125509
6-3
Ensure That Recovery Procedures Are Fully
Documented and Tested
Ensure That Recovery Procedures Are Fully Documented and Tested
Before attempting to automate your operations and recovery procedures, you need to
ensure that they are fully documented and tested. Documenting and testing your
procedures should be done as part of your system and message management strategy:
identifying important messages, defining their severity, and documenting the recovery
steps. This will help you produce a document, sometimes known as an operations
runbook, that specifies the critical events and describes how operators should react to
them. This runbook can also be used to help build a set of EMS filters that select only
the events that are relevant to your users’ environment. In addition, your runbook will
help you define your operational policies as well as procedures.
Section 3, “Recovering From Unplanned Outages,” and Section 4, “Monitoring Event
Messages,” provide more information on this topic.
What Tasks Can Be Automated on Tandem
Systems?
Your automation strategy should include each of the following types of operations
management tasks:
•
Object state monitoring
•
Critical resource monitoring
•
Intervention and recovery tasks
•
Repetitive tasks
•
Problem determination steps
•
Starting batch jobs
Object State Monitoring
Object state monitoring is an operations management task that can be automated easily
to help you determine whether objects in your system environment are in an up, down,
unknown, or odd state.
Section 5, “Monitoring Objects,” provides more detailed information on this topic.
Critical Resource Monitoring
The usage level of an object or resource might indicate a gradual degradation in the
availability of the object (for example, the utilization of the communication line is
reaching its theoretical limit), or it could signal the impending loss of an object (for
example, a critical file is 80 percent full). In general, any object that is critical to the
operation of a system or application should be monitored, and a usage threshold event
should be reported when the usage level of the object exceeds the configured level.
Monitoring critical resources is another task that can be automated.
Defining threshold limits for and monitoring the utilization of the following critical
objects can help you prevent the loss of applications and end-user services:
•
Disk files and volumes percent full
•
Memory queues