Introduction to NonStop Operations Management
Problem Management
Introduction to NonStop Operations Management–125507
6-2
Management Responsibilities
Management Responsibilities
Managing the problem environment is most effective when problem-reporting and
problem-escalation policies and procedures are developed and enforced, and the staff is
trained in outage prevention and recovery.
Establishing Policies and Procedures
Past experience has shown that organizations lacking problem-reporting and problem-
escalation procedures have a higher rate of errors, a less efficient organization, longer
recovery times, and a greater percentage of dissatisfied users.
Established problem-reporting and problem-escalation policies and procedures help you:
•
Ensure that all identified problems are reported, recorded, assigned a priority, and
resolved
•
Track how quickly problems are resolved in order to determine if procedures need to
be improved and if service-level agreements are being met
•
Identify recurring problems in order to eliminate the problems or to help the staff
resolve the problems more quickly
•
Ensure that applications are designed to help your staff resolve problems when they
occur
Table 6-1. Unplanned Outage Classes
Outage Class Description
Physical Physical faults or failure in the hardware.
Examples include system disk failure and network router failure, nonfault-
tolerant hardware configurations (such as unmirrored disk drives), and
nonfault-tolerant application configurations.
Design Design errors such as bugs in design and design failure in hardware or
software.
Examples include an application change that makes the application
unusable by introducing unexpected problems.
Operations Errors caused by operations personnel caused by accident, inexperience, or
malice.
Examples include deleting data, incorrectly installing software, procedural
problems (or lack of procedures), lack of operator training, and basic
operations and maintenance tasks not being done or not being done
correctly.
Environmental Failures in power, cooling, network connections, natural disasters
(earthquake, flood), terrorism, and accidents.
Examples include air-conditioning system failure, power failures (such as
batteries dead, no backup generator), or computer in basement destroyed by
flood.