Introduction to NonStop Operations Management

Problem Management
Introduction to NonStop Operations Management125507
6-18
Check List
Check List
The following check list summarizes the main points of problem management:
1. Maintain a well-trained operations and support staff.
2. Establish problem prevention strategies. Your staff should:
Monitor the hardware and software
Monitor system and application message logs
Automate operations and recovery procedures as much as possible
Ensure that the system’s fault-tolerant features are fully used and maintained
Design your system to take advantage of quick startup and shutdown techniques
Ensure the availability of super-group (255, n) capabilities to solve certain
problems
Be prepared and trained for environmental problems and disasters
Maintain up-to-date and well-tested recovery procedures
3. Establish problem detection procedures. Your staff should:
Monitor the hardware and software
Monitor system and application software message logs
Automate system-monitoring tasks and use monitoring check lists
Monitor TSM incident reports
Act on information received from users reporting problems
4. Establish procedures for reporting problems:
Develop a standard problem report form.
Create and maintain a system outage log.
Designate people responsible for logging problems.
Consider establishing a help desk.
Train staff and users in problem reporting procedures.
5. Establish problem-solving techniques for identifying the cause of a problem and
developing a solution. Using a problem-solving worksheet can help operators
systematically list the facts about a problem, list possible causes, identify the cause,
and develop a solution.
6. Establish problem escalation procedures. Your staff should:
Know who should work on easy-to-fix problems and who should work on
complex problems, and determine the percentage of problems that should be
resolved by each level of support.
Know how long to work on a problem before escalating the problem to the next
level of support.
Know whom to contact for help with system-related and application-related
problems.
Update the problem report form whenever a problem is escalated.
Know which person on each shift is the Tandem contact. The Tandem contact
should understand when and how to contact Tandem.
Know how to take processor memory dumps and obtain copies of system log
files.