Introduction to NonStop Operations Management

Problem Management
Introduction to NonStop Operations Management125507
6-4
Problem Prevention Strategies
Problem Prevention Strategies
You can prevent many problems by implementing the following strategies:
Monitor the hardware and software. To ensure that the system is operating properly
and to recognize when a potential problem might occur, it is important to monitor
continuously the status of all the resources of the system and network. Resources
commonly monitored include processors, disks, paths, devices, processes, spooler
components, audit trails, audit dumps, NonStop TM/MP transactions, tape mount
requests, communication lines, and programs. Monitoring includes:
Monitoring resources as they change states (up or down). (Use the Object
Monitoring Facility [OMF] or TSM.)
Monitoring end-user response time and throughput. (Use ViewSys or NSX.)
Monitoring critical resource utilization (threshold limits, disk files and volumes
percent full, memory queues, message queues, disk queues, processor
utilization, and control block usage). (Use ViewSys or NSX.)
Monitor system and application software message logs by using DSM facilities,
such as EMS and the TSM EMS Event Viewer. DSM also helps developers create
applications that generate events and create log files.
Automate operations and recovery procedures. Examples of tasks that are typically
automated for problem prevention include:
Object state monitoring.
Performance monitoring.
Critical resource monitoring.
Recovery tasks for routine (recurring) problems.
Routine (recurring) tasks. If you have to perform a task more than three times,
automate the task.
Problem determination steps. For example, an event is generated when a line
goes down. Problem analysis tasks, such as gathering information to help you
determine the cause of the failure, can be automated.
For more information on automating operations and automation tools, refer to
Section 12, “Automating and Centralizing Operations.
Make sure that your system is fault tolerant. Tandem systems provide continuous
availability and fault-tolerance features; however, it is up to you to make sure that
these unique features are fully used and maintained.
The Availability Guide for Problem Management provides information on auditing
your system for fault tolerance. Guidelines are included to help you determine the
fault tolerance of your software and hardware configurations.
Design your system and application to take advantage of quick startup and shutdown
techniques. The Availability Guide for Change Management provides operational
strategies for reducing startup and shutdown time. The Availability Guide for