Introduction to NonStop Operations Management
Operations Management and Continuous
Improvement
Introduction to NonStop Operations Management–125507
13-12
Implementing an Operations-Management
Improvement Program
•
Provide a high-level view of the system that operators can easily interpret. OMF
can represent many thousands of objects and their states on one screen. With a
quick look at this screen, operators get an immediate impression of the health of
the system they have to manage.
The improvement team implemented OMF in stages, beginning with processors,
followed by disks, processes, spooler objects, and finally TMF. This helped
operators gain experience with one or two objects at a time.
For more information about object monitoring, refer to the Availability Guide for
Problem Management.
•
Action 4: Implement automation. Completing the preceding actions allowed
operators to display significant events and detect critical conditions before they
occurred. Now the improvement team was ready to implement an automated
operator product. To accomplish this, the improvement team:
•
Used the default rule set to perform problem recovery for the Pathway, Expand,
and SNAX subsystems.
•
Wrote customized recovery rules for their specific installation.
•
Used OMF to develop and optimize new rules for objects monitored by OMF.
•
Coded the automated operator so that an event is generated each time a recovery
rule is executed. This helped operators know when a problem occurred and the
outcome of the recovery.
For more information about implementing automation, refer to the Availability
Guide for Problem Management.
•
Action 5: Implement process statistics. After implementing such significant changes,
the improvement team wanted to measure the results. Specifically, they wanted to
review and optimize the automated recovery rules. To accomplish this, the
improvement team used EMS Analyzer (EMSA) to track the efficiency of
automation. They made the following observations:
•
Manual recoveries increased in December after the operations console was
installed. Because of the improved visibility of messages, operators could detect
and fix problems that were previously unnoticed.
•
After the automated operator was installed, automated recoveries began to
replace manual recoveries.
•
During the first few months after the automated operator was installed, it
recovered from 50 to 80 incidents per week without operator intervention. After
OMF was used to develop and optimize new rules, automated recoveries grew to
300 per week.
Figure 13-3 compares the number of problem events recovered manually with the
number recovered by the automated operator during the improvement program.