Availability Guide for Problem Management

Contents
Availability Guide for Problem Management125509
vi
3. Recovering From Unplanned Outages
3. Recovering From Unplanned Outages
Overview 3-1
Systematic Problem Solving 3-1
Step 1—Detecting and Isolating the Problem 3-2
Monitoring Messages 3-2
Monitoring Objects 3-4
Monitoring Performance 3-6
Step 2—Gathering Facts and Reporting the Problem 3-8
Step 3—Identifying the Cause and Developing and Implementing a Solution 3-14
Tools for Problem Analysis 3-16
Developing and Implementing a Solution 3-18
Step 4—Escalating the Problem 3-19
Step 5—Reviewing the Problem 3-22
Asking the Right Questions 3-22
Detecting Trends 3-22
Performing Root-Cause Analysis 3-23
Tools for Root-Cause Analysis 3-24
4. Monitoring Event Messages
Overview 4-1
What Are System and Application Event Messages? 4-2
Managing System Event Messages 4-2
Managing System Event Messages With EMS 4-3
Why Is System Event Message Management Important? 4-3
Getting Control of System Event Message Management 4-4
Step 1—Analyzing System Event Messages 4-5
Step 2—Filtering System Event Messages 4-7
Step 3—Writing Operations and Recovery Procedures 4-7
Step 4—Automating Operations and Recovery Procedures 4-7
Managing Application Event Messages 4-9
What Is Application Event Message Management? 4-9
Why Is Application Event Message Management Important? 4-9
Getting Control of Application Event Message Management 4-10
Step 1—Instrumenting Applications to Generate EMS Event Messages 4-10
Step 2—Analyzing Application Event Messages 4-14
Step 3—Filtering Application Event Messages 4-15
Step 4—Writing Operations and Recovery Procedures 4-16
Creating Separate Event Monitoring Environments 4-16
Event Monitoring Tools 4-18