Introduction to NonStop Operations Management

ManualsBrandsHP ManualsServerHP NonStop G-Series

241

242

243

244

245

246

247

248

249

250

Operations Management and Continuous

Improvement

Introduction to NonStop Operations Management–125507

13-11

Implementing an Operations-Management

Improvement Program

•

Selected the important messages for each subsystem, defined their severity, and

documented the recovery steps.

•

Produced a document that specified the critical events and described how

operators should respond to them.

•

Used the document to build a set of filters managed by the Event Management

Service (EMS). The EMS filters selected only the events that were relevant to

the users’ environment. The filters also specified whether the events were

critical.

•

Used the document to create an online runbook that defined the operational

procedures for each critical event.

For more information about how to manage system messages, refer to the

Availability Guide for Problem Management.

•

Action 2: Manage application messages. The next problem the improvement team

was faced with was to manage the application messages. To accomplish this, the

improvement team:

•

Established design standards specifying that applications should use EMS to

generate events.

•

Used NonStop Virtual Hometerm Subsystem (VHS) to convert application

messages to EMS format for applications developed before EMS.

•

Created a second console environment dedicated to applications. This

arrangement helped operators understand the cause and effect relationship of

problems. For example, a communication line going down might generate only

one critical system message, whereas the application might generate ten critical

messages.

•

Required that development programmers make minor changes in the application

programs to reduce the number of informational messages. This reduced the

information overload that operators were faced with. Operators could focus on

critical events instead of having to react to every event.

For more information about how to manage application messages, refer to the

Availability Guide for Problem Management.

•

Action 3: Monitor critical objects. At NAC, more than 10,000 objects interacted to

provide end-user services. Processors, disks, printers, communication lines,

processes, files, and terminals had to be fully and continuously operational.

Operators could not possibly verify the health of this system manually.

To meet their object-monitoring needs, the improvement team selected the Object

Monitoring Facility (OMF) software to:

•

Continuously monitor objects at intervals defined by the user and as short as one

minute.

•

Generate events compatible with EMS that can be filtered and displayed on the

operator console and used by automated-operations software to recover from

problems.