Availability Guide for Problem Management

Contents

Availability Guide for Problem Management–125509

3. Recovering From Unplanned Outages

Overview 3-1

Systematic Problem Solving 3-1

Step 1—Detecting and Isolating the Problem 3-2

Monitoring Messages 3-2

Monitoring Objects 3-4

Monitoring Performance 3-6

Step 2—Gathering Facts and Reporting the Problem 3-8

Step 3—Identifying the Cause and Developing and Implementing a Solution 3-14

Tools for Problem Analysis 3-16

Developing and Implementing a Solution 3-18

Step 4—Escalating the Problem 3-19

Step 5—Reviewing the Problem 3-22

Asking the Right Questions 3-22

Detecting Trends 3-22

Performing Root-Cause Analysis 3-23

Tools for Root-Cause Analysis 3-24

4. Monitoring Event Messages

Overview 4-1

What Are System and Application Event Messages? 4-2

Managing System Event Messages 4-2

Managing System Event Messages With EMS 4-3

Why Is System Event Message Management Important? 4-3

Getting Control of System Event Message Management 4-4

Step 1—Analyzing System Event Messages 4-5

Step 2—Filtering System Event Messages 4-7

Step 3—Writing Operations and Recovery Procedures 4-7

Step 4—Automating Operations and Recovery Procedures 4-7

Managing Application Event Messages 4-9

What Is Application Event Message Management? 4-9

Why Is Application Event Message Management Important? 4-9

Getting Control of Application Event Message Management 4-10

Step 1—Instrumenting Applications to Generate EMS Event Messages 4-10

Step 2—Analyzing Application Event Messages 4-14

Step 3—Filtering Application Event Messages 4-15

Step 4—Writing Operations and Recovery Procedures 4-16

Creating Separate Event Monitoring Environments 4-16

Event Monitoring Tools 4-18