Availability Guide for Problem Management

Availability Guide for Problem Management125509
6-1
6
Automating Operations and Recovery
Procedures
Overview
Personnel costs for operations continue to grow in contrast to ongoing improvements in
the price/performance of computer systems. As operating systems and subsystems have
become more complex, the number of operations errors has increased. This situation
demands a transition from old technologies to a new generation of tools that automate
network and system management tasks.
This section describes:
Automation and its importance in problem management
Issues to be considered before automating your system management tasks
How to avoid problems associated with automated recovery
Tandem tools available for automated recovery
What Is Automation?
Automation allows you to intercept and respond to event messages, schedule jobs, and
recover from both planned and unplanned outages programmatically, without operator
intervention. By intervening programmatically, automated systems reduce the
burdensome workload of the operations staff and increase the availability of the system.
Why Is Automation Important?
As large online transaction-processing (OLTP) systems grow, demands on operations
staffs increase at a dramatic rate. Help-desk personnel become overloaded with mundane
tasks while trying to maintain high levels of service to end users. If single component
failures are not repaired promptly, they can quickly escalate and cause significant
outages. This situation demands that operations and recovery tasks be automated.
What You Must Do Before You Automate
Before automating your operations and recovery tasks, you must ensure that:
System and application event messages are being managed effectively.
Important objects are being monitored.
Operations and recovery tasks are fully documented.