Introduction to NonStop Operations Management
Check Lists
Introduction to NonStop Operations Management–125507
B-5
Problem Management
Problem Management
1. Maintain a well-trained operations and support staff.
2. Establish problem prevention strategies. Your staff should:
•
Monitor the hardware and software
•
Monitor system and application message logs
•
Automate operations and recovery procedures as much as possible
•
Ensure that the system’s fault-tolerant features are fully used and maintained
•
Design your system to take advantage of quick startup and shutdown techniques
•
Ensure the availability of super-group (255,n) capabilities to solve certain
problems
•
Be prepared and trained for environmental problems and disasters
•
Maintain up-to-date and well-tested recovery procedures
3. Establish problem detection procedures. Your staff should:
•
Monitor the hardware and software
•
Monitor system and application software message logs
•
Automate system-monitoring tasks and use monitoring check lists
•
Monitor TSM incident reports
•
Act on information received from users reporting problems
4. Establish procedures for reporting problems:
•
Develop a standard problem report form.
•
Create and maintain a system outage log.
•
Designate people responsible for logging problems.
•
Consider establishing a help desk.
•
Train staff and users in problem reporting procedures.
5. Establish problem-solving techniques for identifying the cause of a problem and
developing a solution. Using a problem-solving worksheet can help operators
systematically list the facts about a problem, list possible causes, identify the cause,
and develop a solution.
6. Establish problem escalation procedures. Your staff should:
•
Know who should work on easy-to-fix problems and who should work on
complex problems, and determine the percentage of problems that should be
resolved by each level of support.
•
Know how long to work on a problem before escalating the problem to the next
level of support.
•
Know whom to contact for help with system-related and application-related
problems.