Introduction to NonStop Operations Management

Operations Management and Continuous
Improvement
Introduction to NonStop Operations Management125507
13-11
Implementing an Operations-Management
Improvement Program
Selected the important messages for each subsystem, defined their severity, and
documented the recovery steps.
Produced a document that specified the critical events and described how
operators should respond to them.
Used the document to build a set of filters managed by the Event Management
Service (EMS). The EMS filters selected only the events that were relevant to
the users’ environment. The filters also specified whether the events were
critical.
Used the document to create an online runbook that defined the operational
procedures for each critical event.
For more information about how to manage system messages, refer to the
Availability Guide for Problem Management.
Action 2: Manage application messages. The next problem the improvement team
was faced with was to manage the application messages. To accomplish this, the
improvement team:
Established design standards specifying that applications should use EMS to
generate events.
Used NonStop Virtual Hometerm Subsystem (VHS) to convert application
messages to EMS format for applications developed before EMS.
Created a second console environment dedicated to applications. This
arrangement helped operators understand the cause and effect relationship of
problems. For example, a communication line going down might generate only
one critical system message, whereas the application might generate ten critical
messages.
Required that development programmers make minor changes in the application
programs to reduce the number of informational messages. This reduced the
information overload that operators were faced with. Operators could focus on
critical events instead of having to react to every event.
For more information about how to manage application messages, refer to the
Availability Guide for Problem Management.
Action 3: Monitor critical objects. At NAC, more than 10,000 objects interacted to
provide end-user services. Processors, disks, printers, communication lines,
processes, files, and terminals had to be fully and continuously operational.
Operators could not possibly verify the health of this system manually.
To meet their object-monitoring needs, the improvement team selected the Object
Monitoring Facility (OMF) software to:
Continuously monitor objects at intervals defined by the user and as short as one
minute.
Generate events compatible with EMS that can be filtered and displayed on the
operator console and used by automated-operations software to recover from
problems.