Availability Guide for Problem Management

ManualsBrandsHP ManualsServerHP NonStop G-Series

Availability Guide for

Problem Management

Abstract

This guide describes strategies you can adopt and tools you can use to maximize the availability

of your systems and applications by anticipating, quickly recovering from, and preventing

recurrence of unplanned outages.

Product Version

N.A.

Supported Releases

This manuals supports G01.00 and all subsequent G-series releases until otherwise indicated in

a new edition.

Part Number Published Release ID

125509 December 1996 G01.00

Summary of content (184 pages)

PAGE 1
Availability Guide for Problem Management Abstract This guide describes strategies you can adopt and tools you can use to maximize the availability of your systems and applications by anticipating, quickly recovering from, and preventing recurrence of unplanned outages. Product Version N.A. Supported Releases This manuals supports G01.00 and all subsequent G-series releases until otherwise indicated in a new edition. Part Number Published Release ID 125509 December 1996 G01.
PAGE 2
Document History Part Number Product Version Published 103395 N.A. January 1995 114782 N.A. December 1995 125509 N.A. December 1996 New editions incorporate any updates issued since the previous edition. A plus sign (+) after a release ID indicates that this manual describes function added to the base release, either by an interim product modification (IPM) or by a new product version on a .99 site update tape (SUT). Ordering Information For manual ordering information: domestic U.S.
PAGE 3
New and Changed Information This guide has been revised for use with Himalaya S-Series servers and the G01.00 release of the NonStop Kernel.
PAGE 4
New and Changed Information Availability Guide for Problem Management–125509 iv
PAGE 5
Contents New and Changed Information iii About This Manual xiii Notation Conventions xvii 1.
PAGE 6
3. Recovering From Unplanned Outages Contents 3.
PAGE 7
5. Monitoring Objects Contents 5.
PAGE 8
7. Auditing Systems for Fault Tolerance Contents Using Tandem Tools for Automation 6-5 Automating Job Scheduling and Event Response With CA-Unicenter Scheduling Routine Tasks With NetBatch 6-6 Performing Automatic Memory Dumps With TFDS 6-6 Automation Examples 6-6 TACL Recovery Macros 6-6 TFDS Automated Recovery 6-8 6-5 7.
PAGE 9
. Problem Management Tools Contents Disaster Recovery Planning 8-4 Step 1—Taking Inventory 8-5 Step 2—Developing the Plan 8-6 Step 3—Testing the Plan and Training the Staff Step 4—Revising the Plan 8-8 Backup Sites 8-9 8-8 9.
PAGE 10
Figures Contents Figures Figure 3-1. Figure 3-2. Figure 3-3. Figure 3-4. Figure 4-1. Figure 4-2. Figure 4-3. Figure 5-1. Figure 5-2. Figure 5-3. Figure 5-4. Figure 5-5. Figure 7-1. Figure 8-1. Figure 9-1. Figure 9-2. Figure 9-3. Figure 9-4. Figure 9-5. Figure 9-6. Figure 9-7. Figure 9-8. Figure 9-9. Figure 9-10. Figure 9-11. Figure 9-12.
PAGE 11
Tables Contents Tables Table 1-1. Table 1-2. Table 4-1. Table 4-2. Table 5-1. Table 5-2. Table 5-3. Table 6-1. Table 8-1. Table 9-1.
PAGE 12
Contents Availability Guide for Problem Management– 125509 xii
PAGE 13
About This Manual The Availability Guide for Problem Management explains how to maximize system and application availability by preventing problems from becoming unplanned outages.
PAGE 14
What Is in This Manual? About This Manual What Is in This Manual? This manual is organized into nine sections, as follows: • • • Section 1, “Introduction to Problem Management,” defines problem management and explains how it relates to the OM framework and online management. Section 2, “Preventing Unplanned Outages,” describes common causes of unplanned outages and explains how to predict, prevent, and prepare for them.
PAGE 15
Tandem Professional Audit Services About This Manual • Introduction to Tandem NonStop Systems This manual introduces you to the computing environment of NonStop systems. It describes the online transaction-processing (OLTP) requirements that the NonStop system was designed to meet. It shows how the three layers of the NonStop system (application environment, architecture, and networking) provide a unique and comprehensive solution to the challenges of OLTP.
PAGE 16
Tandem FAXAdvisor About This Manual For more information on Tandem Education courses and training programs, ask your Tandem representative for a copy of the Tandem Education Course Catalog. This catalog contains a complete list of courses, training programs, and training centers. It also contains diagrams showing training paths for a variety of Tandem users, including network mangers, programmers, database administrators, systems and operations management, and technical specialists.
PAGE 17
Notation Conventions Change Bar Notation Change bars are used to indicate substantive differences between this edition of the manual and the preceding edition. Change bars are vertical rules placed in the right margin of changed portions of text, figures, tables, examples, and so on. Change bars highlight new or revised information. For example: The message types specified in the REPORT clause are different in the COBOL85 environment and the Common Run-Time Environment (CRE).
PAGE 18
Change Bar Notation Notation Conventions Availability Guide for Problem Management– 125509 xviii
PAGE 19
1 Introduction to Problem Management Overview Maintaining operations 24 hours a day, 7 days a week, 365 days a year (24x7x365) was once the exclusive domain of critical applications like emergency services, national defense, and telecommunications systems.
PAGE 20
Introduction to Problem Management What Is an Outage? What Is an Outage? In general terms, an outage is a period of time during which a system cannot perform useful work. From an end-user’s perspective, an outage is any period of time during which an application is not available. There are two types of outages: planned and unplanned. Planned Outages A planned outage is system or application downtime that is planned or scheduled.
PAGE 21
Measuring Outages Introduction to Problem Management Outage Minutes While the computer industry often measures availability as percentage of total time, Tandem recommends measuring availability by outage minutes, assuming 24x7x365 operations. Using an outage-minutes-per-year measurement is easy to understand and provides more meaningful data than percentage numbers. Table 1-1 compares percentages with equivalent outage minutes and the resulting user impact. Table 1-1.
PAGE 22
Introduction to Problem Management Measuring Outages If a transient error in the workstation makes the application unavailable to 1 user for 5 minutes, it counts as 5 user minutes of downtime. If the problem on the server makes the application unavailable for 15 minutes to 100 users, it counts as 1500 user minutes of downtime. The correct way to measure an outage affecting a batch program varies from one application to another.
PAGE 23
Introduction to Problem Management • • What Is Problem Management? The famous nine-hour breakdown of a long-distance telephone network in early 1990 dramatized the vulnerability of complex computer systems everywhere. The breakdown ultimately cost the company some $60 to $75 million in lost revenues, averaging $130,000 per minute. After a bomb exploded in the New York World Trade Center in 1992, one of the banks in the building estimated lost revenues of $20 million per day, or $2,500 per minute.
PAGE 24
Introduction to Problem Management Recovering Quickly From Problems That Do Occur Section 7, “Auditing Systems for Fault Tolerance,” describes the fault tolerant features of the Tandem architecture that allow Tandem systems to tolerate single points of failure in hardware and software. Recovering Quickly From Problems That Do Occur Despite the best planning and prevention, unplanned outages can still occur.
PAGE 25
Introduction to Problem Management Tandem’s Commitment to Problem Management Solutions Tandem’s Commitment to Problem Management Solutions In keeping with your increasing needs for highly available systems, Tandem has introduced a new program designed to provide you with world-class technical support.
PAGE 26
Reporting Problems Introduction to Problem Management Reporting Problems To assist you as quickly as possible, your TNSC representative will request the following information: • • • • Your system number Your name and company The product involved The level of severity Problem Severity It helps the TNSC if you categorize your problem in one of these severity levels: • • • • No impact—You have general questions or need information.
PAGE 27
2 Preventing Unplanned Outages Overview When unplanned outages occur, systems or applications may become unavailable to the end user. By preventing unplanned outages, you will move closer to the goal of 24-houra-day, 7-day-a-week, 365-day-a-year (24x7x365) operations. This section defines unplanned outages and describes types of unplanned outages that can affect the availability of your system.
PAGE 28
Preventing Unplanned Outages Common Causes of Unplanned Outages Common Causes of Unplanned Outages Tandem studies repeatedly identify four common causes of unplanned outages, listed here in order of frequency of occurrence: • • • • Operations management errors Nonfault-tolerant hardware configuration Nonfault-tolerant application design Environmental problems Operations Management Errors This category is the single most common cause of unplanned outages.
PAGE 29
Preventing Unplanned Outages Preventing Problems From Becoming Outages Preventing Problems From Becoming Outages In most computer environments, the first goal of problem management is to reduce or eliminate problems that can escalate into unplanned outages. Tandem systems are designed to survive any single component failure, but not all double component failures.
PAGE 30
Preventing Unplanned Outages Goals and Strategies An application environment may consist of thousands of objects (processors, terminals, disk drives, communications lines, files, processes, and so on) that need to be present and in the correct state to be available to end users. You need to ensure that critical objects are automatically monitored to keep them available to users. You also need to understand the dependencies that may exist between these objects, for example, disk space and processor cycles.
PAGE 31
Preventing Unplanned Outages Goals and Strategies Operations management documentation should include descriptions of processes and routine tasks, and it should indicate who is responsible for these processes and tasks. • Documenting your problem-detection, escalation, and recovery procedures. Define procedures for monitoring system hardware and software, system and application message logs, and user requests.
PAGE 32
Preventing Unplanned Outages Requirements for Successful Problem Prevention Requirements for Successful Problem Prevention Generally, organizations that do not have established problem-reporting, problemescalation, and problem-recovery procedures have a higher rate of errors, increased recovery times, and lower levels of user or customer satisfaction. They might also experience an increased number of occurrences of the same problem.
PAGE 33
Preventing Unplanned Outages • • • • • Well-Trained Staff Make sure that recovery procedures documentation is easily accessible to operations and support staff. Make sure that full copies of all manuals, including all relevant application user manuals, are available in the operations area, either online or in hard-copy format. Make sure that backup tapes are available and readable. Standardize external labeling, and ensure that files can be restored.
PAGE 34
Preventing Unplanned Outages System Configuration Documentation System Configuration Documentation Maintain documentation that describes your system in its “normal” state. Include descriptions of all major system components, their configurations, and how they deliver services. Be able to identify what is, and what should be, running on your system. • • • • • • • • Maintain charts, diagrams, and lists that describe the physical and logical configuration or your system.
PAGE 35
Availability of Super-Group Capabilities Preventing Unplanned Outages Availability of Super-Group Capabilities Make sure that operators, system managers, and others who may need super-user or super-group capabilities have access to them. While a super-group user ID (255,n) is not needed under normal conditions, it may be required to solve certain problems. Having access to a super-group password is often the fastest—and sometimes the only way—to solve a problem.
PAGE 36
Preventing Unplanned Outages Where to Find More Information Availability Guide for Problem Management– 125509 2- 10
PAGE 37
3 Recovering From Unplanned Outages Overview Even the best planning and prevention cannot avoid all unplanned outages. When unplanned outages do occur, a methodical approach can help you pinpoint the cause quickly. Using efficient problem-resolution techniques will save you time and money. This section describes how to get your system or application back online quickly after an unplanned outage by implementing efficient problem-resolution techniques.
PAGE 38
Recovering From Unplanned Outages Step 1—Detecting and Isolating the Problem Step 1—Detecting and Isolating the Problem To respond to problems quickly, operations personnel must be aware that a problem exists. Active system monitoring can help reduce the time needed to detect and resolve problems.
PAGE 39
Recovering From Unplanned Outages Monitoring Messages CA-Unicenter for Tandem—Event Management Function The CA-Unicenter Event Management facilities provide for integrated message management across a heterogeneous network environment.
PAGE 40
Recovering From Unplanned Outages Monitoring Objects Monitoring Objects Monitoring important objects in your system environment can help you detect problems that can become unplanned outages.
PAGE 41
Monitoring Objects Recovering From Unplanned Outages Object Monitoring Facility (OMF) OMF allows you to supervise objects such as processors, disks, files, and processes within your Tandem environment. OMF monitors objects for peripherals or subsystems, according to the object configuration stored in and maintained by OMF. When an object does not respond to the configured settings, an informative statement, action attention, or critical event message is routed to an EMS collector.
PAGE 42
Monitoring Performance Recovering From Unplanned Outages Monitoring Performance Monitoring system performance allows you to detect resource consumption problems that could escalate into unplanned outages.
PAGE 43
Monitoring Performance Recovering From Unplanned Outages Network Statistics Extended (NSX) NSX collects and displays processor, process, and Expand process statistics from other NonStop systems operating in a Tandem network. NSX monitors and reports network statistics from multiple network nodes to one or more locations within that network. In addition, NSX collects and reports on the busiest processes in each processor in all monitored systems.
PAGE 44
Recovering From Unplanned Outages Step 2—Gathering Facts and Reporting the Problem Step 2—Gathering Facts and Reporting the Problem When a problem is detected, relevant facts need to be collected, and appropriate personnel must be notified. Consider establishing procedures for reporting problems.
PAGE 45
Recovering From Unplanned Outages Gathering the Facts Facts About the Circumstances • • • • • • • • Who reported the problem and where can they be contacted? What was the user doing when the problem occurred? What events led up to the problem? What information was displayed when the problem occurred? What do event messages, error logs, and memory dumps reveal? What has changed recently that might have caused the problem? What is the current configuration of the hardware and software products affected (i
PAGE 46
Gathering the Facts Recovering From Unplanned Outages Figure 3-2.
PAGE 47
Recovering From Unplanned Outages Gathering the Facts Maintaining an Outage Log Use an outage log to record the event or problem that caused the outage, the time the outage occurred, what action was taken, and when the outage was “closed” or resolved. Outage logs provide a useful tool for tracking outages. For example, they provide an accurate assessment of system availability and can be used for trend analysis, to set and maintain service-level objectives, and to identify improvement areas.
PAGE 48
Gathering the Facts Recovering From Unplanned Outages Figure 3-3. Sample Outage Log OUTAGE LOG Page:______________ Node:_______________ Revised: ____/____/____ by _________________ Line # Time Open Initials Date:_____/_____/_____ Event Action Taken Time Closed Initials 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 CDT 011 Adapt this log to your operations environment.
PAGE 49
Recovering From Unplanned Outages Gathering the Facts cross-references, you might want to identify log entries, perhaps by page number and line item number, thus creating a unique entry. Who Is Responsible for Logging Problems? Designate the people responsible for logging and tracking problems. For example, you could require that all problems found by operators be logged in an operator log, and all problems encountered by users be logged by help-desk operators.
PAGE 50
Recovering From Unplanned Outages Step 3—Identifying the Cause and Developing and Implementing a Solution Step 3—Identifying the Cause and Developing and Implementing a Solution Using the information obtained from the problem-reporting and outage logs, you can speculate about what caused the problem, and you can develop and implement a solution.
PAGE 51
Identifying the Cause Recovering From Unplanned Outages Figure 3-4. Sample Problem-Solving Worksheet PROBLEM-SOLVING WORKSHEET Problem Facts Possible Causes Terminal Hardware Terminal Comm. Config. Lines System Controller TACL Move What? 2 terminals down $WHS2.#TRM7 Yes Yes Yes Yes Yes Yes $WHS4.#TRM20 Yes Yes Yes Yes Yes No $WHS2.#TRM7 on east wall Yes Yes No No Yes No $WHS4.#TRM20 on west wall Yes Yes No No Yes Yes One on Tuesday at 8:00 a.m.
PAGE 52
Recovering From Unplanned Outages Tools for Problem Analysis Tools for Problem Analysis Tandem provides a variety of tools to help you analyze any problems that may be occurring in your environment. These tools include: • • Event Management Service (EMS) Analyzer Tandem Failure Data System (TFDS) Event Management Service (EMS) Analyzer The EMS Analyzer product selects events from EMS log files. You specify parameters, such as subsystem ID, event number, text, start time, and stop time.
PAGE 53
Recovering From Unplanned Outages Tools for Problem Analysis Tandem Failure Data System (TFDS) TFDS isolates software problems and provides automatic processor failure data collection, diagnosis, and recovery services.
PAGE 54
Recovering From Unplanned Outages • • Developing and Implementing a Solution PEEK/CPUx/ (for all processors) provide statistical information about memory, system tables, and other resources; includes configured parameters and high-water marks of system resources since the counts were last reset. PATHWAY STATS, depending on the Pathway applications being run on the system, can be issued for TCP, TERM, and SERVER elements in the application.
PAGE 55
Recovering From Unplanned Outages Step 4—Escalating the Problem Step 4—Escalating the Problem Some problems are simple and can be resolved by the person who reports the problem. Other problems must be forwarded or escalated to more knowledgeable personnel for resolution. At each step in the problem-solving process, you must decide whether you should proceed or get help.
PAGE 56
Recovering From Unplanned Outages • • • • • Deciding Whether to Escalate the Problem Develop a list of people who can help the operations staff resolve problems. You should list contacts for each application running on the system, and for system software and hardware. You should also include the names and current contact numbers of your Tandem representatives. Update the problem-report log each time a problem is escalated to another level of support.
PAGE 57
Recovering From Unplanned Outages Deciding Whether to Escalate the Problem Information You Should Provide When contacting your Tandem representative, be prepared to provide as much relevant information as possible, including: • • • • Descriptions of the problem and accompanying symptoms Details of error or operator messages generated Supporting documentation such as EMS logs, trace files, and a processor dump, if applicable System number and the numbers and versions of all related products What Tandem
PAGE 58
Recovering From Unplanned Outages Step 5—Reviewing the Problem Step 5—Reviewing the Problem When a problem is resolved, the solution can be recorded and the problem report can be closed. Reviewing problems and solutions with a focus on prevention can help the operations staff prevent the same problems from recurring.
PAGE 59
Recovering From Unplanned Outages Performing Root-Cause Analysis These reports help you measure the performance of your staff, determine whether service-level agreements are being fulfilled, and determine what training or changes are needed to improve your problem reporting, tracking, escalation, and recovery procedures. Performing Root-Cause Analysis Root-cause analysis can be an important tool in your problem-review activities.
PAGE 60
Recovering From Unplanned Outages Tools for Root-Cause Analysis Tools for Root-Cause Analysis Some of the Tandem tools you might employ in your root-cause analysis include: • • • Event Management Service (EMS) logs Flow Map ViewSys Availability Guide for Problem Management– 125509 3- 24
PAGE 61
4 Monitoring Event Messages Overview Subsystems and applications generate messages to report changes in their state. Monitoring these messages is critical to getting the most out of your online environment: event messages advise you about the health and status of your system. You need to monitor event messages in a way that prevents important or critical messages from being overlooked in a flow of predominantly noncritical, informational messages.
PAGE 62
Monitoring Event Messages What Are System and Application Event Messages? What Are System and Application Event Messages? Event messages are a special subset of Subsystem Programmatic Interface (SPI) messages. Like all SPI messages, the information in event messages is contained in tokens but, unlike many SPI messages, event messages take advantage of formatting templates that convert the tokenized information into local-language text.
PAGE 63
Monitoring Event Messages Managing System Event Messages With EMS Managing System Event Messages With EMS The Event Management Service (EMS) allows you to manage system event messages and the information they provide from the generation of a message in the subsystem environment to the generation of text for display in the operations environment. EMS provides the following event message management capabilities: • • • Event message building.
PAGE 64
Getting Control of System Event Message Management Monitoring Event Messages • • Monitoring a running network or system. Your own management application can be used to recognize situations needing attention as they arise. Depending on the problem and the sophistication of the application, the problem can then be resolved by the operator or the application through the appropriate command-response interface. Managing operator tasks.
PAGE 65
Step 1—Analyzing System Event Messages Monitoring Event Messages Step 1—Analyzing System Event Messages For each subsystem, you need to analyze the event messages, select the important event messages, and estimate their severity. Important messages are any that report an occurrence that might affect the availability of the system or network. Severity levels might be defined as follows: Severity Level Meaning Warning A potential problem has been detected.
PAGE 66
Step 1—Analyzing System Event Messages Monitoring Event Messages Using EMS Analyzer You can use EMS Analyzer to examine and analyze events to determine the status of the devices, subsystems, and applications on your system. EMS Analyzer can read all EMS events generated on a NonStop system and produce an ENSCRIBE or comma separated value (CSV) database of events. Using EMS Analyzer, you can generate reports to create a profile of the system and evaluate the number and type of messages being generated.
PAGE 67
Step 2—Filtering System Event Messages Monitoring Event Messages Step 2—Filtering System Event Messages EMS allows you to filter event messages to reduce the number of messages and highlight messages that require operator attention or intervention. What Are EMS Filters and How Are They Used? The event log file is read by the EMS distributor processes configured onto or started on the system.
PAGE 68
Step 4—Automating Operations and Recovery Procedures Monitoring Event Messages Table 4-1. Operations Runbook Daily Tasks General Tasks Specific Tasks Check for messages from system users. Check telephone, fax, electronic mail, and any other messages. Check operator messages. Use a printing distributor or other application. Check system status, including terminals, processors, communication lines, key applications, and system processes.
PAGE 69
Monitoring Event Messages Managing Application Event Messages Managing Application Event Messages When your company offers a new business function, you must identify and meet the service objectives of that function. If the application is down frequently or requires a significant amount of operations support, you fail to meet your service-level objectives, and the cost of providing the business function or service escalates.
PAGE 70
Monitoring Event Messages What Are the Goals of Application Event Message Management? What Are the Goals of Application Event Message Management? The most important goal of application event message management is to solve the “information overload” problem. Large applications may generate so many event messages that operators cannot concentrate on reading and responding to the critical messages that require intervention.
PAGE 71
Step 1—Instrumenting Applications to Generate EMS Event Messages Monitoring Event Messages Deciding Which Events to Report The first task is to determine what events your applications can detect and which of those it should report to EMS. Generate messages for critical or action events only. Once you have decided what events to report, you will need to decide whether or not any of those events should be considered critical or action events.
PAGE 72
Step 1—Instrumenting Applications to Generate EMS Event Messages Monitoring Event Messages In general, make things as easy as possible for the recipients of event messages. A particular event-message type is implemented only once, but it may be filtered, retrieved, and displayed many times.
PAGE 73
Step 1—Instrumenting Applications to Generate EMS Event Messages Monitoring Event Messages enhancements to these standard events. The names of these events indicate the conditions being reported, as follows: • • • • • • • Object Available Object Other State Change Object Unavailable Operator Attention Completed Operator Attention Needed Transient Fault Usage Threshold Table 4-2 lists the management functions and the type of events designed to support them. Table 4-2.
PAGE 74
Monitoring Event Messages Step 2—Analyzing Application Event Messages Using EMS FastStart to Develop and Test EMS Event Messages EMS FastStart is a TACL-based code generator that generates and compiles a number of source files that are used to simplify event generation and testing. EMS FastStart enhances development of applications by providing a simple, cost-effective way for programmers to develop and test EMS event messages.
PAGE 75
Step 3—Filtering Application Event Messages Monitoring Event Messages Figure 4-2.
PAGE 76
Step 4—Writing Operations and Recovery Procedures Monitoring Event Messages Creating EMS Event Filters To make a filter for a forwarding, printing, or consumer distributor, you can create an edit file containing the filter-language constructs that express your selection criteria. You then use the filter-language compiler (EMF) to generate an object file suitable for loading to the distributor.
PAGE 77
Creating Separate Event Monitoring Environments Monitoring Event Messages Figure 4-3.
PAGE 78
Event Monitoring Tools Monitoring Event Messages Event Monitoring Tools In addition to the basic event monitoring capabilities of EMS, Tandem provides the Tandem Service Management package (TSM) EMS event viewer and the CA-Unicenter for Tandem Event Management function for monitoring event messages. EMS Collectors, Distributors, and Filters EMS provides basic tools you can use to selectively monitor messages from specific sources.
PAGE 79
5 Monitoring Objects Overview Monitoring important objects in your system environment can help you predict, prevent, and detect problems that may result in unplanned outages. This section defines object monitoring, describes the types of objects you should monitor, and gives examples of tools available to help you monitor critical objects effectively.
PAGE 80
What Should You Monitor? Monitoring Objects What Should You Monitor? To effectively manage your system environment for higher availability, you need to develop an appropriate object monitoring strategy. Tandem products allow you to monitor objects and object states, as well as performance and critical resource utilization. Determining What to Monitor The first step in developing an effective object monitoring strategy is to determine what to monitor.
PAGE 81
Determining What to Monitor Monitoring Objects Figure 5-1. Pathway Object Diagram Pathway Owns PATHMON Manages TCP Manages Server Class Manages Terminal Executes Program CDT 026 As shown in Figure 5-1, multiple constraints affect a number of related objects. You need to understand the constraints in your applications to determine which objects are critical to those applications and, therefore, require monitoring.
PAGE 82
Commonly Monitored Objects Monitoring Objects Object Behavior For each object identified, you should define its valid states, state transitions, possible conditions that make it change states (such as a user command or an internal error in the subsystem), and the corresponding actions. For each state change caused by an internal or external condition, the object may have a predefined set of actions. Often, the action is to generate an EMS event message that informs the system of the object’s state change.
PAGE 83
Object States Monitoring Monitoring Objects Object States Monitoring Object states monitoring is the process of monitoring the states and state changes of objects in your system environment. An object can have many valid states. The four main categories are: up, down, unknown, and odd. Up An object is up when it is started. In this state, the object is defined in the subsystem and fully meets all of its operational objectives. It can be used to provide services.
PAGE 84
Performance Monitoring Monitoring Objects When an object goes into an odd state, you need sufficient information to bring the object back into an up state. This is preventive recovery, because the object is still providing services; but if the situation is not corrected, a more important problem can occur.
PAGE 85
Critical Resource Utilization Monitoring Monitoring Objects Critical Resource Utilization Monitoring The usage level of an object or resource might indicate a gradual degradation in the availability of the object (for example, the utilization of the communication line is reaching its theoretical limit), or it could signal the impending loss of an object (for example, a critical file is 80 percent full).
PAGE 86
Disk Space Analysis Program (DSAP) Monitoring Objects You can receive DSAP output in several different report formats that allow you to analyze the disk differently. The different types of reports used by the DSAP utility are listed in Table 5-2. Table 5-2. DSAP Report Types Report Options Report Name Report Contents ANALYSIS Combines Summary, Free Space, File Extent, File Size, Subvolume, and User reports Provides all of the reports. BYSUBVOL Subvolume Summary Space allocation for each subvolume.
PAGE 87
Disk Space Analysis Program (DSAP) Monitoring Objects Figure 5-2. Sample DSAP Summary Report Disk Space Analysis Program -- T9074C30 - (31DEC90) -- 10/12/91 Tandem Computers Incorporated 1981, 1983, 1985-1990 Volume %SYSTEM is logical device 6 Device type is 3, subtype 10 ( 4130 -- 415MB ) 203,014 pages (2048 bytes) on volume 415,772,672 bytes on volume Summary of space use on %SYSTEM 39,947 free pages in 580 extents (19.6%). 161,564 allocated pages in 2,785 files in 7,488 extents (79.5%).
PAGE 88
File Utility Program (FUP) Monitoring Objects File Utility Program (FUP) FUP is a component of the Tandem NonStop Kernel that can help you manage disk files, nondisk devices (printers, terminals, tape drives), and processes on your Tandem system. You can use FUP to create, display, and duplicate files; load data into files; alter file characteristics; and purge files.
PAGE 89
File Utility Program (FUP) Monitoring Objects Figure 5-3.
PAGE 90
Measure Monitoring Objects The DETAIL listing format for NonStop SQL/MP views contains the same type of information as the other DETAIL listing format. Figure 5-4. DETAIL Format for NonStop SQL/MP Views filename date-and-time object-type CATALOG catalog-name BASE TABLE base-table-name PART ( [ \node.] $volume ) . .
PAGE 91
TACL Monitoring Macros Monitoring Objects TACL Monitoring Macros TACL macros can be developed to run as object monitoring tools. Executed on a scheduled basis, TACL macros can inform operators of potential problems such as a stopped process, a disk or file becoming full, or a transaction running too long. Note. TACL macros do not run in NonStop mode, so they might not be as reliable as OMF.
PAGE 92
Tandem Object Monitoring Facility (OMF) Monitoring Objects • Process statistics, such as • • • • • Busiest processes Percent processor busy for each process Messages sent and received Receive queue length Average memory pages used NSX and the Tandem Object Monitoring Facility (OMF) NSX and OMF have been enhanced to provide an integrated, network-wide view of both the performance and the operational status of the objects in a Tandem network.
PAGE 93
ViewSys Monitoring Objects Figure 5-5. Object Monitoring With OMF Management Applications Monitoring Environment Subsystem Environment OMF Help OMF Help DB OMFLdev OMF Manager Pathway TCP OMFDB OMF Monitor OMFCOM NSK TAPE DP2 EMS Collector SPOOLER EMS Log TMF CDT 013 ViewSys ViewSys is a system resource monitor that displays processor performance statistics and resource consumption for a set polling period.
PAGE 94
Summary of System Resource Monitoring Tools Monitoring Objects Compatibility With Measure ViewSys accesses the same system tables as Measure. Because ViewSys does not write to any of these tables, ViewSys and Measure can be run simultaneously. The impact to Measure is limited to the resources used by the program itself. Most of the impact to Measure is due to the interprocessor communications necessary to gather individual processor values.
PAGE 95
6 Automating Operations and Recovery Procedures Overview Personnel costs for operations continue to grow in contrast to ongoing improvements in the price/performance of computer systems. As operating systems and subsystems have become more complex, the number of operations errors has increased. This situation demands a transition from old technologies to a new generation of tools that automate network and system management tasks.
PAGE 96
Automating Operations and Recovery Procedures Ensure That Messages Are Being Managed Efficiently Ensure That Messages Are Being Managed Efficiently Managing system Event Management Service (EMS) event messages is an important part of your automation strategy because it allows operators to be notified quickly of error conditions, state changes, and threshold limits that have been exceeded. Critical events can be highlighted on the system console.
PAGE 97
Automating Operations and Recovery Procedures Ensure That Recovery Procedures Are Fully Documented and Tested Ensure That Recovery Procedures Are Fully Documented and Tested Before attempting to automate your operations and recovery procedures, you need to ensure that they are fully documented and tested. Documenting and testing your procedures should be done as part of your system and message management strategy: identifying important messages, defining their severity, and documenting the recovery steps.
PAGE 98
Automating Operations and Recovery Procedures • • • • • Repetitive Tasks Message queues Processor utilization Control block usage Disk queues Spooler cleanup The Tandem Object Monitoring Facility (OMF) can be used to monitor these objects. For example, when a critical process fails, OMF detects it and generates an EMS event. An automated operator will receive the event and execute a customized PROCESS recovery rule, which will send the event-related information to a TACL server.
PAGE 99
Automating Operations and Recovery Procedures Starting Batch Jobs Starting Batch Jobs Use the NetBatch product to automatically schedule jobs, such as those that summarize or post results at the end of the day. NetBatch allows you to run jobs or job steps anywhere in an Expand network, which means you can automate and consolidate reporting for widely distributed applications.
PAGE 100
Automating Operations and Recovery Procedures Scheduling Routine Tasks With NetBatch Scheduling Routine Tasks With NetBatch NetBatch allows you to automate job scheduling, startup, and management tasks on your system. It increases throughput by enabling job distribution among the systems’ processors, and it frees operations staff for other work by reducing the need for user intervention. A NetBatch job is a process or a sequence of processes that performs specialized tasks.
PAGE 101
Automating Operations and Recovery Procedures TACL Recovery Macros Warm Starting a Drained Spooler (Manually) Bringing the spooler from the warm state to the active state is called warm starting the spooler. When the spooler is in a dormant state, the supervisor is not running. As soon as you create another supervisor process, the spooler enters the warm state. When you warm start the spooler, you use the same control files and other files that were in use when the spooler was previously drained.
PAGE 102
TFDS Automated Recovery Automating Operations and Recovery Procedures TFDS Automated Recovery TFDS monitors processors and automatically initiates a processor dump if a failure occurs. The failed processor is reloaded automatically, and the processor dumped is analyzed with the incident database to determine whether the failure is the result of a recurring or known defect. TFDS creates an incident database that tracks specific problem occurrences.
PAGE 103
Automating Operations and Recovery Procedures TFDS Automated Recovery Table 6-1. TFDS Configuration Parameters Parameters Functions AUTOMATIC-BACKUP If an incident occurs, TFDS does not back up dump information automatically. BACKUPTIMEOUT The TFDS backup of the files is delayed 60 minutes. CRUNCH-FILE This parameter enforces the use of the CRUNCH process located under $SYSTEM.SYS03.CRUNCHR. DB-LOCATION The database files are generated by TFDS under $SYSTEM.TFDS.
PAGE 104
Automating Operations and Recovery Procedures Availability Guide for Problem Management– 125509 6- 10 TFDS Automated Recovery
PAGE 105
7 Auditing Systems for Fault Tolerance Overview Auditing your system for fault tolerance is one of the most important ways to prevent unplanned outages in your system environment. A fault-tolerance audit identifies any potential problems that expose your online environment to unnecessary risk. Once these problems are identified and resolved, you will have moved your system closer to your goal of 24 hour-a-day, 7 day-a-week, 365 day-a-year (24x7x365) operations.
PAGE 106
Continuous Operations Auditing Systems for Fault Tolerance Figure 7-1.
PAGE 107
Auditing Systems for Fault Tolerance Continuous Operations Fault Tolerance in the Client/Server Environment Tandem provides fault tolerance in the client/server environment with its new NonStop Access for Networking (NSAN) product, a joint Tandem and Ungermann-Bass effort. NSAN is a networking solution that delivers fault tolerance from the server out to the desktop through the creation of primary and alternate paths between PCs, networking hubs, and Tandem NonStop and Integrity servers.
PAGE 108
Auditing Systems for Fault Tolerance Performing a Fault-Tolerance Audit Performing a Fault-Tolerance Audit If a crisis is prepared for, it becomes much less of a crisis. To handle a wide variety of problems requires detailed study and preparation. One of the best ways to prepare for and prevent problems that can cause unplanned outages is to perform a detailed risk analysis.
PAGE 109
Auditing Systems for Fault Tolerance Configuring Your Hardware for Fault Tolerance Configuring Your Hardware for Fault Tolerance You can ensure that your hardware configuration is fault tolerant by performing the following tasks (some of which can be automated) in your system environment: • • • • • Testing backup paths Performing powerfail testing Configuring your hardware adequately for stress periods Using mirrored disk drives Avoiding a system freeze Testing Backup Paths While the preferred path to
PAGE 110
Auditing Systems for Fault Tolerance Configuring Processors for Stress Periods loads the environment stored prior to power off, and processing can continue automatically. The system automatically resumes operations within a few minutes after power is restored. After bringing disks and tapes back to full operating speed, the system recovers any files protected by the NonStop Transaction Manager/MP (TM/MP) that might have been compromised, and resumes processing transactions against these files.
PAGE 111
Auditing Systems for Fault Tolerance Using Mirrored Disk Drives running at an average of 95 percent busy during peak periods. If a single processor were to fail during such a peak period, it is highly unlikely that the remaining processors would be able to perform well enough to take over the processing requirements of the downed processor. Note. If a single processor fails, you need to redistribute the load between the remaining processors.
PAGE 112
Auditing Systems for Fault Tolerance Configuring Your Software for Fault Tolerance Configuring Your Software for Fault Tolerance Fault tolerance requires that all programs—the operating system as well as individual application programs—contribute to the reliability and recoverability of a process if a failure occurs. Therefore, your software should also be audited for fault tolerance.
PAGE 113
Auditing Systems for Fault Tolerance Testing Applications for Graceful Recovery NonStop TM/MP has an additional benefit: it not only simplifies application design but also extends fault tolerance to protect against multiple failures. For example, if both the primary and mirror disk volumes on which a database resides suffer simultaneous head crashes, NonStop TM/MP is able to recover the data.
PAGE 114
Auditing Systems for Fault Tolerance Using Persistent Processes Using Persistent Processes Processes that only supply services to other processes but otherwise maintain no data of their own need only to continue to execute. For such processes, it might be appropriate simply to ensure that the process gets restarted whenever it stops. A monitor process that periodically checks the process status can restart the process. Processes monitored in this way are sometimes called persistent processes.
PAGE 115
8 Planning for Disasters Overview Contingency planning can help you prevent, prepare for, and recover from a disaster. Disasters can occur any time and anywhere. In companies where day-to-day business activity is tied to a computer system, a sound recovery plan is imperative. Planning ahead can help you avert some disasters and respond to those disasters you cannot avert.
PAGE 116
Computer Center Location and Facilities Planning for Disasters This subsection provides tips on reviewing the following: • • • • • • The computer center location and facilities Security Preventive maintenance and system-monitoring procedures Network and system configurations Data recovery and integrity Data archiving procedures Computer Center Location and Facilities Review Section 3, “The Operations and Support Areas,” of the Introduction to Nonstop Operations Management to ensure that your computer c
PAGE 117
Data Recovery and Integrity Planning for Disasters The Expand subsystem extends fault-tolerant operations to networks of geographically distributed computer systems. You can use Expand to connect Tandem NonStop systems at different locations to form a single network in which communications paths are constantly available, even in the event of a single line or component failure.
PAGE 118
Disaster Recovery Planning Planning for Disasters • • • • Ensure that rooms or facilities used for archiving have controls and sensors that detect and warn of extreme temperature, humidity, smoke, or other contamination. Determine whether data should be stored at a location separate from the computer facility and whether you need fireproof data vaults. If you do not have an off-site facility for data storage, you can arrange for off-site storage through a vendor.
PAGE 119
Step 1—Taking Inventory Planning for Disasters Figure 8-1. The Disaster Planning Process 1. Gain Support of Executive Staff 2. Form Planning Team 3. Take Inventory 4. Develop the Plan 5. Test the Plan and Train the Staff 6. Revise and Update the Plan as Needed CDT 031 Step 1—Taking Inventory As a first step toward preparing a recovery plan, the planning team usually determines what is at risk and prioritizes the risks. Taking inventory involves answering these questions: 1.
PAGE 120
Step 2—Developing the Plan Planning for Disasters 4. When is a situation a disaster; that is, when should the disaster plan be activated? For example, if there is a fire near a site, when should the disaster plan be activated—when the fire is next door, in the building, or in the computer room? 5. Who has the authority to declare a disaster? 6. Is insurance available? Should your company purchase insurance for loss of equipment or business? 7.
PAGE 121
Step 2—Developing the Plan Planning for Disasters • A list of all materials and services that must be available during a disaster, along with information on how to access the materials and services. Note. Contracts and service agreements with third parties might be required for some of these materials and services. Following are items that should be available: • • • • • • • • • • • • • Additional copies of the disaster recovery plan.
PAGE 122
Step 3—Testing the Plan and Training the Staff Planning for Disasters • • • • Backup site procedures. If your company has a backup site, the planning team should document the procedures for moving to the alternate site. For more information about backup sites, see “Backup Sites” later in this section. Procedures for using the Remote Duplicate Database Facility (RDF) during a disaster.
PAGE 123
Backup Sites Planning for Disasters Backup Sites An important part of developing a recovery plan is determining whether or not your company needs a backup site. A backup site is a second site that is available for use when a disaster stops operations at your primary site. Depending on the type of backup site, you can restart operations at the backup location within 10 minutes to 30 days. Your company can maintain the backup site or pay another company to maintain the site.
PAGE 124
Backup Site Alternatives Planning for Disasters Cold Sites A cold site (sometimes called a cold shell) is an empty shell or building with power, air conditioning, data communications lines, and water at the site. When a disaster occurs, you move all necessary equipment, software, data, and personnel to the site. Plan on 20 or more days to make the cold site operational. Cold sites are practical when disasters of major proportions occur. For disasters that last less than 30 days, a cold site is not viable.
PAGE 125
Backup Site Alternatives Planning for Disasters Online-Ready Sites Online-ready sites (also referred to as processing-ready sites) are secondary computer sites that are ready to take over processing from a primary site within an hour without loss of data. Table 8-1. Backup Site Alternatives: Advantages and Disadvantages (page 1 of 2) Backup Site Advantages Disadvantages Cold Site Inexpensive way to acquire or lease a second computer site. No equipment or operating costs until a disaster occurs.
PAGE 126
Backup Site Alternatives Planning for Disasters Table 8-1. Backup Site Alternatives: Advantages and Disadvantages (page 2 of 2) Backup Site Advantages Disadvantages Mutual Backup Site May be least expensive way to establish a backup site. Requires less capital investment. Realistic recovery plan can be tested. During nondisaster periods, site may be shared by participants for development work.
PAGE 127
9 Problem Management Tools Overview Tandem provides a variety of tools you can use to detect, analyze, recover from, and track problems in your operations environment.
PAGE 128
What Is in This Section? Problem Management Tools Table 9-1.
PAGE 129
CA-Unicenter for NonStop Servers Problem Management Tools CA-Unicenter for NonStop Servers CA-Unicenter for NonStop Servers provides a set of integrated systems management and problem solving functions. You can access these functions through either a graphical user interface or a command-line interface.
PAGE 130
Management and Problem-Solving Tools Problem Management Tools Security Management Use the CA-Unicenter Security Management function to implement policy-based access validation using a central security database. This relational database structure defines the relationships between users and system assets like files, programs, and user IDs. Spool Management Use the CA-Unicenter Spool Management function to define printers, manage spool job queues, and control the spooler itself.
PAGE 131
Event Management Service (EMS) Problem Management Tools Event Management Service (EMS) Tandem’s primary tool for event collection is the Event Management Service (EMS), which is a set of processes that collects event messages from Tandem subsystems (including NonStop operating system processes) and user-written subsystems. EMS then selectively distributes those event messages to various destinations, such as a local operator console or a management application running on a remote system.
PAGE 132
How Does EMS Collect, Filter, and Distribute Event Messages? Problem Management Tools How Does EMS Collect, Filter, and Distribute Event Messages? There are two types of EMS processes that manage the flow of event messages from the subsystem environment to the operations environment: event-message collectors and event-message distributors.
PAGE 133
Event Message Collectors Problem Management Tools Figure 9-1. Flow of Event Messages Tandem Subsystems User-written Subsystems Alternate Collectors Primary Collector ($0) Log Log Files Files Compatibility Distributor ($Z0) Log Files Forwarding Distributor Consumer Distributor Printing Distributor Filter Filter Filter To Remote Collector Management Application Printer Console Log Files Legend Arrows indicate flow of event messages. Solid lines represent original event-message stream.
PAGE 134
Event Message Distributors Problem Management Tools Event Message Distributors EMS provides four distributor processes that collect event messages from the event log file (or alternate log files). Three of these distributor processes format these messages into operator messages and then distribute these operator messages to various destinations.
PAGE 135
Where to Find More Information About EMS Problem Management Tools Filter Language and Compiler To make a filter for a forwarding, printing, or consumer distributor, you can create an edit file containing the filter-language constructs that express your selection criteria. You then use the filter-language compiler (EMF) to generate an object file suitable for loading to the distributor.
PAGE 136
Event Management Service (EMS) Analyzer Problem Management Tools Event Management Service (EMS) Analyzer EMS messages provide important information about devices, subsystems, and applications. Tandem subsystems, such as Expand, NonStop Transaction Manager/MP (TM/MP), Pathway, and NonStop SQL/MP, as well as your own applications, can generate thousands of messages. EMS Analyzer allows you to examine and analyze this information, which is saved in the EMS event log files.
PAGE 137
Where to Find More Information About EMS Analyzer Problem Management Tools Figure 9-2. EMS Analyzer Architecture Tandem Subsystems User-written Subsystems Obey File Listings Collector EMS Log Files EMS Analyzer Distributor EMS Analyzer Database EMSAFLTR CDT 035 Where to Find More Information About EMS Analyzer EMS Analyzer is described in the Event Management Service (EMS) Analyzer User’s Guide and Reference Manual.
PAGE 138
Flow Map Problem Management Tools Flow Map The Flow Map product is a performance-analysis tool with two components, Flow Map Host (FMH) and Flow Map PC (FMP). FMH filters, reduces, and formats performance data collected by the Tandem Performance Data Collector (TPDC) and Measure products. FMP is a Windows workstation graphical user interface. It creates a flow diagram showing how the processes, files, and connections of an application running on the Tandem system interact. Note.
PAGE 139
Where to Find More Information About Flow Map Problem Management Tools Figure 9-3.
PAGE 140
Measure Problem Management Tools Measure Measure is a data collection and measurement tool that provides a wide range of performance statistics on system resources. Using Measure, you can gather information from systems, network components, and your business applications. Then you can use this data to balance and tune your system, detect bottlenecks, balance workloads, and perform sizing evaluations for your new applications. You can also use Measure data for capacity planning.
PAGE 141
Where to Find More Information About Measure Problem Management Tools Figure 9-4.
PAGE 142
NetBatch and NetBatch Plus Problem Management Tools NetBatch and NetBatch Plus NetBatch allows you to automate job scheduling, startup, and management of NonStop systems. NetBatch increases job throughput by allowing you to distribute jobs among the system’s processors. It also frees operations staff for other work by reducing the need for operator intervention in repetitive jobs. NetBatch Plus is a screen-driven interface for NetBatch.
PAGE 143
Automating Operations With NetBatch Problem Management Tools Automating Operations With NetBatch NetBatch allows you to automate the following operations tasks: • • • • Distribute job workload according to processor availability and OLTP demands Schedule jobs to run automatically or at times specified by a run calendar, or delay execution of jobs for a specified period after submission Send information about certain job-related and scheduler-related events to Event Management Service (EMS) collectors Tr
PAGE 144
Network Statistics Extended (NSX) Problem Management Tools Network Statistics Extended (NSX) Tandem Network Statistics Extended (NSX) is a NonStop operating system network performance monitor that collects and displays processor, process, and Expand process statistics from other NonStop systems operating in a Tandem network. NSX monitors and reports high-level, real-time network statistics from multiple network nodes to one or more locations within that network.
PAGE 145
Where to Find More Information About NSX Problem Management Tools Figure 9-6. NSX Architecture Conversational Interface NSS Presentation Block-Mode Interface SEENET Workstation Interface Batch Reporting NSX GUI ENFORM Stats Stats SEEGATE Command & Control NSS NSS Stats Database NSX.DB Stats Collection COLLECT Stats Node Monitors Stats Gathering Processes \A.MONITOR \B.MONITOR \C.MONITOR SGP 0 SGP 0 SGP 0 SGP 1 SGP 1 \n.
PAGE 146
NonStop Access for Networking (NSAN) Problem Management Tools NonStop Access for Networking (NSAN) NSAN is a joint Tandem and Ungermann-Bass networking solution that delivers fault tolerance from the server out to the desktop by creating primary and alternate paths between PCs, networking hubs, and Tandem NonStop and Integrity servers. NSAN provides fault tolerance to client and server applications by using fully redundant local area networks (LANs) and LAN connections.
PAGE 147
How NSAN Works Problem Management Tools A status message stating that “I am alive” status packets have stopped arriving indicates that one of the controller pair has a problem. TLAM can determine whether the problem lies in the controller receiving or sending the status messages by noting whether the media packets are still being received. TLAM also monitors the state of the controller and the channel controls for signals of failure.
PAGE 148
NonStop Transaction Manager/MP (TM/MP) Problem Management Tools NonStop Transaction Manager/MP (TM/MP) In the NonStop TM/MP product, the TMF subsystem provides transaction protection, database consistency, and database recovery. TMF sustains high performance in highvolume online transaction-processing (OLTP) applications. To support OLTP applications, the TMF subsystem can monitor thousands of complex transactions sent by hundreds of users to a common database.
PAGE 149
How TM/MP Works Problem Management Tools Transaction Protection Because transactions usually consist of a series of operations, more than one transaction at a time can threaten database consistency and concurrency, making transaction management complex. The TMF subsystem protects the transaction as a single unit, making sure that either all or none of the changes in a transaction are applied to the database.
PAGE 150
Where to Find More Information About NonStop TM/MP Problem Management Tools Where to Find More Information About NonStop TM/MP NonStop TM/MP is described in the following manuals: • • • • • • • • Introduction to NonStop Transaction Manager/MP (TM/MP) NonStop TM/MP Application Programmer’s Guide NonStop TM/MP Configuration and Planning Guide NonStop TM/MP Management Programming Manual NonStop TM/MP Operations and Recovery Guide NonStop TM/MP Reference Manual NonStop TM/MP Reference Summary NonStop TM/MP
PAGE 151
NonStop Virtual Hometerm Subsystem (VHS) Problem Management Tools NonStop Virtual Hometerm Subsystem (VHS) The NonStop Virtual Hometerm Subsystem (VHS) acts as a virtual home terminal for applications by emulating a 6530 terminal. VHS receives messages normally sent to the home terminal, such as displays, application prompts, COBOL run-time library errors, and Inspect or Debug prompts.
PAGE 152
Where to Find More Information About VHS Problem Management Tools Figure 9-8. VHS Architecture Management Applications VHS Subsystem Application Environment Application VHS VHS Prompt File VHS Prim. Log File Inspect or Debug Console Facilities EMS Collector EMS Log Automation Software Saveabend File Consumer Distributor CDT 040 Where to Find More Information About VHS VHS is described in the NonStop Virtual Hometerm Subsystem (VHS) Manual.
PAGE 153
Object Monitoring Facility (OMF) Problem Management Tools Object Monitoring Facility (OMF) The Object Monitoring Facility (OMF) allows you to supervise objects such as processors, disks, files, and processes within your Tandem environment. OMF monitors objects for peripherals or subsystems, according to the object configuration stored in and maintained by OMF.
PAGE 154
How OMF Works Problem Management Tools How OMF Works OMF consists of a number of requesters that provide summary and detailed status information. The servers periodically check the status of configured objects and provide that status information to the requesters. OMF uses a configuration file (CONFIG) to store information about the monitored objects, and an edit file containing a list of devices for device monitoring. OMF categorizes various objects as in up, odd, or down states. Note.
PAGE 155
Open Notification Service (ONS) Problem Management Tools Open Notification Service (ONS) The ONS Subagent gathers Event Management Service (EMS) events from the Tandem system event log, $0, and translates these events into Simple Network Management Protocol (SNMP) traps that are sent to a network management platform through the Tandem NonStop agent.
PAGE 156
How ONS Works Problem Management Tools Figure 9-10.
PAGE 157
Where to Find More Information about ONS Problem Management Tools Flow Control ONS provides a flow-control mechanism to avoid sending unwanted subsystem data to the network management platform. At startup, an event filter for all subsystems is enabled. Filter tables in the ONSMIB (the MIB that defines the objects used to control and communicate with $ZONS) allow you to disable event filtering for a given subsystem, either for all events or for individual events.
PAGE 158
Subsystem Control Facility (SCF) Problem Management Tools Subsystem Control Facility (SCF) The Subsystem Control Facility is used to configure, control, and collect information about Tandem subsystems. You use the Subsystem Control Facility (SCF) on G-series systems to configure, control, and display information about configured objects within SCF subsystems. Each SCF subsystem responds to and processes SCF commands that affect that subsystem.
PAGE 159
Where to Find More Information About SCF Problem Management Tools SCF can also be used programmatically. The programmatic interface to SCP is controlled by the Subsystem Programmatic Interface (SPI), which builds and retrieves information from command, response, and event-message buffers. Figure 9-11 illustrates the SCF architecture. Figure 9-11.
PAGE 160
Problem Management Tools Tandem Failure Data System (TFDS) Tandem Failure Data System (TFDS) The Tandem Failure Data System (TFDS) is an operations management tool that isolates software problems and provides automatic processor failure data collection, diagnosis, and recovery services. TFDS automatically collects data from dumps of frozen or halted processors, online processor dumps, and saveabend snapshot dumps.
PAGE 161
Where to Find More Information About TFDS Problem Management Tools Figure 9-12. TFDS Architecture \LOCAL System monitor Incident database TFDS User interface DUMP ESlog EMSlog $0 \REMOTE DUMP ESlog EMSlog CDT 044 Where to Find More Information About TFDS TFDS is described in the Tandem Failure Data System (TFDS) Manual.
PAGE 162
Tandem Performance Data Collector (TPDC) Problem Management Tools Tandem Performance Data Collector (TPDC) The Tandem Performance Data Collector is a Tandem host-based performance data collection and relationship product. TPDC significantly reduces the expertise and manpower required to collect performance data.
PAGE 163
Where to Find More Information About TPDC Problem Management Tools Device Distribution by Type Device Type Description ----------------OPERATOR CONSOLE DISK MAGNETIC TAPE TERMINAL SNAX SERVICE MGR TMF MONITOR IPB MONITOR ZNUP 6100 CSM MULTILAN SNAX-SDLC NCP EXPAND-LH ----------------Total Devices Type ---1 3 4 6 13 21 27 28 50 56 58 62 63 ---- Count -----2 26 1 18 1 1 1 1 2 1 2 1 7 -----64 Where to Find More Information About TPDC TPDC is described in the Tandem Performance Data Collector Manual.
PAGE 164
Tandem Service Management (TSM) Problem Management Tools Tandem Service Management (TSM) The Tandem Service Management (TSM) package is a client/server application that provides troubleshooting, maintenance, and service tools for Himalaya S-series servers. TSM consists of software components that run on the server and on a PC-compatible workstation. The TSM software on the workstation has a graphical user interface (GUI) with extensive online help.
PAGE 165
TSM Components Problem Management Tools TSM Server Software The TSM server software is the major component of TSM that resides on the Himalaya S-series server. When the NonStop Kernel operating system is running, the workstation communicates with the TSM server software on the Himalaya S-series server.
PAGE 166
Incident Reports Problem Management Tools TSM Notification Director The TSM Notification Director receives notifications and incident reports from the Himalaya S-series server, displays them, and allows you to take action or forward the incident reports to your service provider for resolution. The TSM Notification Director runs on the TSM workstation at all times, even when the TSM application is not being used.
PAGE 167
TSM EMS Event Viewer Problem Management Tools Problem Incident Reports When critical changes occur on the Himalaya S-series server—changes that might affect the availability of system resources—the TSM server software generates a problem incident report. You can configure the TSM Notification Director to forward (dial out) problem incident reports to a service provider. Problem incident reports refer to attachments.
PAGE 168
ViewSys Resource Monitor Problem Management Tools ViewSys Resource Monitor ViewSys is a system resource monitor that displays processor performance statistics and resource consumption for a set polling period. Viewing the resource allocations across processors on a running system allows you to balance the application load more evenly. ViewSys can help you decide when to move user processes to less busy processors and disk files or when to relocate partitions to less busy disk volumes.
PAGE 169
Glossary This glossary includes a selection of terms used in this manual. Definitions of application and communications subsystem terms are brief and not very detailed; they are intended only to make this manual more meaningful than it would otherwise be to readers unfamiliar with Tandem’s application and communications subsystems. application subsystem. A Tandem product that provides users with application services.
PAGE 170
command file Glossary command file. A file that contains a series of commands. When the file is executed, the commands within the file are automatically executed. Command files are supported by the Tandem Advanced Command Language (TACL) and many Tandem subsystems. communications controller. A hardware component that manages communications lines or devices. communications subsystem. A Tandem product that provides users with access to a set of communications services.
PAGE 171
downtime Glossary downtime. Time during which the NonStop system is not capable of doing useful work because of a planned or unplanned outage. From the end user’s perspective, downtime is any time the application is not available. The cost of downtime can be dramatic in lost revenue, lost consumer confidence, and lost productivity. DSM. See Distributed Systems Management. DSAP. See Disk Space Analysis Program (DSAP). DSMS. See Distributed Systems Management Solutions (DSMS). EMS.
PAGE 172
graphical user interface (GUI) Glossary graphical user interface (GUI). A type of screen interface that typically includes pull-down menus, icons, dialog boxes, and online help. input/output (I/O) process. A system process that manages communications with I/O devices (such as disks and printers) and communications lines. An I/O process pair logically “owns” one or more I/O devices or communications lines. I/O processes are the system processes that make up a communications subsystem.
PAGE 173
NonStop Virtual Hometerm Subsystem (VHS) Glossary NonStop Virtual Hometerm Subsystem (VHS). A subsystem that acts as a virtual home terminal for applications by emulating a 6530 terminal. NonStop VHS receives messages normally sent to the home terminal, such as displays and application prompts, and uses these messages to generate event messages for EMS, which can in turn be used to inform operations staff of problems. NSX. See Network Statistics Extended (NSX). object.
PAGE 174
operations outage class Glossary operations outage class. An outage class that includes errors caused by operations personnel caused by accidents, inexperience, or malice. operator message. The text displayed for a system operator that describes an event. outage. Time during which the NonStop system is not capable of doing useful work because of a planned or unplanned outage. From the end user’s perspective, an outage is any time the application is not available. outage class.
PAGE 175
physical outage class Glossary physical outage class. An outage class that includes physical faults or failure in the hardware. Any type of hardware component failure belongs in this category. planned outage. Time during which the system is not capable of doing useful work because of a planned interruption. A planned outage can be time when the system is brought down to allow for servicing, upgrades, backup, or general maintenance. process. A unique execution of a program. process pair.
PAGE 176
servers Glossary servers. The programs that receive messages from requesters, perform specified operations (for example, database inquiries, database updates, or numerical calculations), and return reply messages to requesters. SNA. See Systems Network Architecture; IBM’s networking architecture. SNAX product family. The product family that consists of those Tandem software products that provide access to IBM Systems Network Architecture (SNA) networks. SPI. See Subsystem Programmatic Interface (SPI).
PAGE 177
SYSGEN Glossary SYSGEN. The utility program used by Install to generate an operating system image for a given hardware and software configuration. Systems Network Architecture (SNA). The prevalent IBM communications model. Syshealth. A maintenance and diagnostic software package that polls devices and monitors system events on Tandem NonStop systems. When a problem occurs, Syshealth generates an alarm describing the problem and sends notification to a remote site. TACL.
PAGE 178
TFDS Glossary screen programs and their I/O devices or processes and, with the help of PATHMON, establishes links between screen programs and Pathway server processes. TFDS. See Tandem Failure Data System (TFDS). TMDS. See Tandem Maintenance and Diagnostic System. TPDC. See Tandem Performance Data Collector (TPDC). transaction. An explicitly delimited operation or set of related operations that alters the content of a database. unplanned outage.
PAGE 179
Index A B Alarm detection, threshold 5-6 Analysis, root-cause 3-23/3-24 Application design fault tolerant design 2-2 Application messages See Event messages Applications generating EMS event messages 4-10/4-14 testing for graceful recovery 7-9 Architecture for remote support 1-7 ASO See Automated systems operations Audit for fault tolerance, performing 7-4 Automated systems operations (ASO) 6-1 Automating batch jobs, using NetBatch 6-5, 6-6 memory dumps, using Tandem Failure Data System (TFDS) 6-6 operati
PAGE 180
E Index Data (continued) communications, preventing failures 7-10 Data-ready sites 8-9/8-10 Design outage class 2-1 Disasters See also Backup sites 8-9 backup sites 8-9 command posts 8-6 definition of 8-1 planning for 8-1/8-3 preventing 8-1/8-3 recovery planning 8-4/8-8 Disk drives, mirrored 7-7 Disk Space Analysis Program (DSAP), monitoring objects 5-7 Documenting recovery procedures 6-3 system set-up 2-8 Downtime, cost of 1-4 DSAP See Disk Space Analysis Program DSM/NOW ICC in CA-Unicenter 9-4 E Educat
PAGE 181
F Index F M Failure data communications, preventing 7-10 power 7-5 Fault tolerance application design 2-2 configuring hardware for 2-2, 7-5 software for 7-8 performing audit for 7-4 preventing unplanned outages 1-6 using NonStop Transaction Manager/MP (TM/MP) 7-8 Fault-tolerant operations definition of 7-1 how Tandem systems achieve 7-2 using NonStop Access for Networking (NSAN) 7-3 File Utility Program (FUP) 5-10 Flow Map 9-12 Freeze, system avoiding 7-7 FUP See File Utility Program (FUP) Macros, TACL
PAGE 182
N Index Monitoring performance (continued) using Tandem Performance Data Collector (TPDC) 3-7 using ViewSys 3-7 Mutual backup sites 8-9 N NetBatch automating batch jobs 6-5, 6-6 overview 9-16/9-17 Network Statistics Extended (NSX) monitoring objects 5-13 performance 3-7 overview 9-18/9-19 NonStop Access for Networking (NSAN) 8-3 NonStop TM/MP See NonStop Transaction Manager/MP (TM/MP) NonStop Transaction Manager/MP (TM/MP) for fault tolerance 7-8 online dumps 8-3 NonStop VHS See NonStop Virtual Hometerm
PAGE 183
P Index P Pairs, process, using 7-9 Paths, backup, testing 7-5 Performance monitoring using measurements 5-6 Performing audit for fault tolerance 7-4 powerfail testing 7-5 Persistence, hot, immediate 7-10 Persistent processes, using 7-10 Physical outage class 2-1 Planned outages 1-2 definition of 1-2 Power failure 7-5 Powerfail protection, testing 7-5/7-6 Preparing for problems 2-4/2-5 Preventing failures, data communications 7-10 problems 2-4 Problem analysis using EMS Analyzer 3-16 using Tandem Failure
PAGE 184
T Index Supplies and equipment in case of disaster 8-2 System and application messages, monitoring 3-2 System freeze, avoiding 7-7 System messages See Event messages Systematic problem solving 1-6, 3-1 System-setup, documenting 2-8 T TACL macros for monitoring objects 5-13 recovery macros (automation example) 6-6 Tandem Advanced Command Language See TACL Tandem Education Group 2-7 Tandem Failure Data System (TFDS) automating memory dumps 6-6 automation example 6-8 CPUDUMP command 3-17 for problem analysi