Introduction to NonStop Operations Management
Operations Management and Continuous
Improvement
Introduction to NonStop Operations Management–125507
13-8
Problem Scenario
Problem Scenario
The complexity of NAC’s systems was growing rapidly. Managers in the MIS
department had to ensure that each of the 10,000 objects was installed and configured
correctly and ran efficiently. The business applications and the system generated more
than 15 events (status, warning, and problem messages) per minute. However, most
problems were reported by end users over the phone. Even the most experienced
operators had difficulty detecting, recognizing, and recovering from problems in this
complex environment.
In addition, because business services were now available almost continuously, the
operations group no longer had periods of down time in which to perform maintenance
and installation tasks.
Implementing an Operations-Management Improvement Program
As the quality of end-user services decreased, the MIS managers recognized that it
would take a serious effort to cope with these new challenges. The MIS managers
decided to initiate an operations-management improvement program, assigning a team
of two senior support analysts to the project.
The following paragraphs describe the improvement team’s step-by-step implementation
of the improvement program.
Step 1—Assessing the Environment
The improvement team decided to assess their operations management processes by
measuring outages, observing the working environment, and analyzing the effectiveness
of their existing tools and processes. Based on their assessment, they concluded that
their operations management processes were at maturity level 1. The following
paragraphs summarize the improvement team’s assessments.
•
Application outages were too frequent. The improvement team required help-desk
operators to log each outage, the time of occurrence, end-user name, business
services affected, and the time to repair (outage duration). After analyzing the logs,
the improvement team determined that during peak hours of the day, the help desk
received from 20 to 25 phone calls per hour. Each outage took between 5 and 20
minutes to resolve.
•
In most cases, operators did not detect problems. Generally, end users phoned in to
report problems.
•
Sometimes operators learned of a critical situation only when scores of messages
started printing on hard-copy consoles.
•
There were so many messages that the operators could not sift through them and
take effective action. All application and system messages were directed to hard-
copy consoles configured as the HOMETERM device.
•
All problem recovery was performed manually. The hard-copy console arrangement
provided inadequate support for problem detection and analysis. Because operators
had trouble correlating the information on many pages of listings, they couldn’t see
what was going on in the system and couldn’t control it.