Introduction to NonStop Operations Management

ManualsBrandsHP ManualsServerHP NonStop G-Series

121

122

123

124

125

126

127

128

129

130

Problem Management

Introduction to NonStop Operations Management–125507

6-4

Problem Prevention Strategies

You can prevent many problems by implementing the following strategies:

•

Monitor the hardware and software. To ensure that the system is operating properly

and to recognize when a potential problem might occur, it is important to monitor

continuously the status of all the resources of the system and network. Resources

commonly monitored include processors, disks, paths, devices, processes, spooler

components, audit trails, audit dumps, NonStop TM/MP transactions, tape mount

requests, communication lines, and programs. Monitoring includes:

•

Monitoring resources as they change states (up or down). (Use the Object

Monitoring Facility [OMF] or TSM.)

•

Monitoring end-user response time and throughput. (Use ViewSys or NSX.)

•

Monitoring critical resource utilization (threshold limits, disk files and volumes

percent full, memory queues, message queues, disk queues, processor

utilization, and control block usage). (Use ViewSys or NSX.)

•

Monitor system and application software message logs by using DSM facilities,

such as EMS and the TSM EMS Event Viewer. DSM also helps developers create

applications that generate events and create log files.

•

Automate operations and recovery procedures. Examples of tasks that are typically

automated for problem prevention include:

•

Object state monitoring.

•

Performance monitoring.

•

Critical resource monitoring.

•

Recovery tasks for routine (recurring) problems.

•

Routine (recurring) tasks. If you have to perform a task more than three times,

automate the task.

•

Problem determination steps. For example, an event is generated when a line

goes down. Problem analysis tasks, such as gathering information to help you

determine the cause of the failure, can be automated.

For more information on automating operations and automation tools, refer to

Section 12, “Automating and Centralizing Operations.”

•

Make sure that your system is fault tolerant. Tandem systems provide continuous

availability and fault-tolerance features; however, it is up to you to make sure that

these unique features are fully used and maintained.

The Availability Guide for Problem Management provides information on auditing

your system for fault tolerance. Guidelines are included to help you determine the

fault tolerance of your software and hardware configurations.

•

Design your system and application to take advantage of quick startup and shutdown

techniques. The Availability Guide for Change Management provides operational

strategies for reducing startup and shutdown time. The Availability Guide for