Availability Guide for Problem Management

Recovering From Unplanned Outages
Availability Guide for Problem Management125509
3-23
Performing Root-Cause Analysis
These reports help you measure the performance of your staff, determine whether
service-level agreements are being fulfilled, and determine what training or changes are
needed to improve your problem reporting, tracking, escalation, and recovery
procedures.
Performing Root-Cause Analysis
Root-cause analysis can be an important tool in your problem-review activities. Root-
cause analysis is a problem isolation and detection methodology that can help you
prevent the same problems from recurring in your environment. It allows you to
investigate all of the factors and conditions that contribute to a problem.
One approach to root-cause analysis is to localize the problem by trying to reproduce it.
Another approach is to vary the modules that may have contributed to the problem, for
example:
Run the application on different hardware
Substitute a prior release of the software
Try to duplicate the problem with a minimum (or maximum) configuration of
hardware and software
Try different connection methods
Essentially, root-cause analysis helps you to determine whether you have detected all of
the causes of a problem. Root-cause analysis might include the following steps:
Identifying the hardware and software components involved in the problem and the
“order” and configuration in which they participated in it.
Eliminating hardware and software components that did not participate in the
problem.
Mapping out how the components involved in the problem interact:
Among themselves
With recent changes to the environment
With components missing (or added)
Determining how the problem reacts to various changes, such as:
Starting and stopping the components involved
Changing the application load
Changing the time of day, patterns of usage, and so on
Making other operational changes
Checking halt and error codes in system and Guardian manuals to see if the problem
is software-related or hardware-related.
Checking error codes and event messages to see if the problem is related to an
external product.
Testing your hypothesis