Introduction to NonStop Operations Management Abstract This manual introduces operations managers to NonStop operations management.
Document History Part Number Product Version Published 103801 N.A. December 1994 114333 N.A. December 1995 125507 N.A. December 1996 New editions incorporate any updates issued since the previous edition. Ordering Information For manual ordering information: domestic U.S. customers, call 1-800-243-6886; international customers, contact your local sales representative. Document Disclaimer Information contained in a manual is subject to change without notice.
New and Changed Information The Introduction to NonStop Operations Management manual has been revised to: • • Delete references to all operations management products and features, manuals, and NonStop systems that are not supported in the G01.00 release. Products include: CMI, CSM, DSC/COUP, Envoy, InfoWay, Install, NonStop NET/MASTER, PUP, RCP, RDF, RMI, ROF, Surveyor, Syshealth, Tandem CD Read, TMDS, and ViewPoint.
New and Changed Information • • • • • • In Section 9, “Security Management,” all references to non-supported security management tools such as NonStop NET/MASTER, PUP, and RMI have been removed. In Section 10, “Contingency Planning,” all references to RDF have been removed. Section 11, “Application Management,” has been updated to document the TSM EMS Event Viewer’s role in application management. Section 12, “Automating and Centralizing Operations,” has been updated to document the TSM EMS Event Viewer.
Contents New and Changed Information iii About This Manual xix Notation Conventions xxv 1.
2. The Operations Staff Contents 2.
4. Operations Documentation Contents System Installation 3-6 Computer Room Environments 3-6 Office Environments 3-7 Preventive Maintenance 3-7 Both Computer-Room and Office Environments 3-7 Computer Room Environments 3-8 Office Environments 3-8 Support Areas 3-8 Check List 3-9 4.
6.
7. Change and Configuration Management Contents Case Study 6-12 Business Background and System Configuration 6-12 Business and Operations Activities 6-13 Problem Scenario 6-14 Gathering Facts About the Problem 6-14 Gathering Facts About the Situation 6-14 Determining the Cause and Resolving the Problem 6-15 Problem Management Tools 6-17 Check List 6-18 7.
9.
9.
10. Contingency Planning Contents Interoperability With Safeguard Security Special Security Concerns 9-21 Program Development 9-21 PROGID Programs 9-22 Licensed Programs 9-23 Check List 9-25 9-20 10.
12. Automating and Centralizing Operations Contents Client/Server Processing 11-10 Case Study 11-14 Business Background 11-14 Analysis of Problem 11-15 Implementation of Recommendations 11-16 Check List 11-18 12. Automating and Centralizing Operations Overview 12-1 Why Automate and Centralize Operations? 12-1 Automating Operations Tasks 12-4 Centralizing System Operations 12-5 Automation and Centralization Tools 12-6 Check List 12-7 13.
Contents 14.
A. Additional Reading Contents Transfer 14-22 TSM EMS Event Viewer ViewSys 14-23 14-23 A.
Figures Contents Figures Figure 1-1. Figure 1-2. Figure 2-1. Figure 2-2. Figure 2-3. Figure 2-4. Figure 2-5. Figure 4-1. Figure 4-2. Figure 4-3. Figure 4-4. Figure 4-5. Figure 5-1. Figure 6-1. Figure 6-2. Figure 6-3. Figure 6-4. Figure 7-1. Figure 8-1. Figure 9-1. Figure 9-2. Figure 10-1. Figure 10-2. Figure 10-3. Figure 11-1. Figure 11-2. Figure 11-3. Figure 11-4. Figure 12-1. Figure 12-2. Figure 13-1. Figure 13-2. Figure 13-3.
Tables Contents Tables Table 1-1. Table 1-2. Table 2-1. Table 5-1. Table 6-1. Table 6-2. Table 7-1. Table 8-1. Table 9-1. Table 10-1. Table 11-1. Table 11-2. Table 12-1. Table 13-1. Table 13-2. Table 13-3. Table 14-1. Table 14-2.
Contents Introduction to NonStop Operations Management– 125507 xviii
About This Manual Overview The Introduction to NonStop Operations Management manual provides an overview of Tandem operations management concepts, tasks, products, and manuals for NonStop systems. This manual is a prerequisite for reading other Tandem operations manuals.
What’s in This Manual? About This Manual This manual is organized in 14 sections, two appendixes, and a glossary. The glossary defines technical terms and acronyms. Section 1, “Overview of NonStop Operations Management” This section defines operations management and explains how to apply the operations management model in a Tandem environment.
What’s in This Manual? About This Manual manage change. It also lists the products Tandem offers to help with change and configuration management tasks. Section 8, “Performance Management” This section defines performance management and provides guidelines for managing system and network performance to help you ensure that you get the best return from your NonStop systems and that the systems meet your business needs.
Prerequisite Reading About This Manual Appendix A, “Additional Reading” This appendix provides a list of documents that provide additional information about the topics and products mentioned in this manual. Appendix B, “Check Lists” The check lists from each section in this manual are reproduced in this appendix so that you can easily use the check lists for note taking or photocopying.
Your Comments Invited About This Manual name, company name, address, and phone number in your message. If your comments are specific to a particular manual, also include the part number and title of the manual. Many of the improvements you see in Tandem manuals are a result of suggestions from our customers. Please take this opportunity to help us improve future manuals.
Your Comments Invited About This Manual Introduction to NonStop Operations Management– 125507 xxiv
Notation Conventions General Syntax Notation The following list summarizes the notation conventions for syntax presentation in this manual. UPPERCASE LETTERS. Uppercase letters indicate keywords and reserved words; enter these items exactly as shown. Items not enclosed in brackets are required. For example: MAXATTACH lowercase italic letters. Lowercase italic letters indicate variable items that you supply. Items not enclosed in brackets are required.
Change Bar Notation Notation Conventions Introduction to NonStop Operations Management– 125507 xxvi
1 Overview of NonStop Operations Management Overview Your business benefits from effective operations management practices. With today’s rapidly changing marketplace and business pressures of global competition, educated consumers, and economic conditions, Tandem recognizes that operations organizations are often faced with ever-increasing demands. With thoughtful planning and management of system operations, you will be prepared to run your Tandem NonStop systems efficiently and effectively.
Overview of NonStop Operations Management • Service-Level Agreements Optimizing the features of Tandem NonStop systems and software. Through the optimal use of Tandem NonStop systems’ fault-tolerant, scalable, distributed processing, and many other features, you will be able to meet your operations management objectives. Service-Level Agreements Every operations organization should consider developing service-level agreements.
Overview of NonStop Operations Management • • Determining Operations Management Objectives Data security Reduced cost of operation Determining Operations Management Objectives By determining your operations management objectives, requirements, and standards, and aligning your operations goals with the goals of the company, you can determine: • • • • • The type of staff coverage to provide The tasks the staff should perform The types of equipment you need The type of budget you need Your department’s p
Production Management Overview of NonStop Operations Management Figure 1-1 shows the OM disciplines working together to ensure a stable and predictable OM environment. Figure 1-1.
Overview of NonStop Operations Management Problem Management Tandem provides a number of tools to manage the production environment, including tools for: • • • • Monitoring systems, networks, and applications online Automating operator procedures Managing distributed systems from a central site Managing networks, databases, and applications For guidelines and suggestions on managing the production environment, refer to Section 5, “Production Management.
Overview of NonStop Operations Management Change Management Change Management Change management includes the tasks required to manage the maintenance and growth of your NonStop system. Change management involves managing all hardware, software, and procedural changes and includes all of the tasks required to properly manage change within the operations environment.
Overview of NonStop Operations Management Performance Management Performance Management Performance management includes the tasks required to manage the performance of your computer system.
Overview of NonStop Operations Management Managing Operations From an End-User’s Perspective Managing Operations From an End-User’s Perspective Today’s globalization of consumers and the demand for increased customer service require that many businesses offer services around the clock. Offering services around the clock requires computer, network, and application services that are available all the time.
Viewing Availability From an End-User’s Perspective Overview of NonStop Operations Management Table 1-1. LAN Availability and Down Time per 40-Hour Workweek (Traditional Measurement) Percentage of Time LAN Is “Up” Equivalent Number of Minutes LAN Is “Down” 90 percent 240 minutes 95 percent 120 minutes 99 percent 24 minutes Using an Outage-Minutes-per-Year Measurement Tandem recommends using a total outage-minutes-per-year measurement to reveal outages.
Overview of NonStop Operations Management Maximizing Availability Alternate Ways of Measuring Down Time Depending on specific business needs, down time may be measured in ways other than user-outage minutes. For example, a site might be obligated to pay a penalty for each transaction that does not get processed while an application is down. Such a site might supplement its measure of down time by keeping records of the number of transactions it normally processes by minute and by day of the week.
Overview of NonStop Operations Management Tandem NonStop Systems and Software Reducing or Eliminating Unplanned Outages Unplanned outages occur when system or application down time is caused by a problem situation such as faulty hardware, operator error, or disaster. An example of such a problem is an application change that makes the application unusable by introducing unexpected problems.
Overview of NonStop Operations Management Tandem Software fault-tolerant operation (one that does not stop because of a single point of failure), you need to make all aspects of the operation fault-tolerant.
Overview of NonStop Operations Management Where to Go for More Information Where to Go for More Information This manual provides an overview of system operations. After reading this manual, you might want to find out more about specific concepts, products, or procedures.
Tandem Software Publications Overview of NonStop Operations Management Figure 1-2.
Overview of NonStop Operations Management World Wide Web (WWW) Home Page World Wide Web (WWW) Home Page For customers with Internet access and a Web browser, Tandem maintains a home page on the World Wide Web. The universal resource locator (URL) for Tandem’s home page is http://www.tandem.
Overview of NonStop Operations Management Tandem Hardware and Software Support Tandem Hardware and Software Support Tandem provides hardware and software support.
Overview of NonStop Operations Management International Tandem Users’ Group (ITUG) International Tandem Users’ Group (ITUG) ITUG is an independent organization of over 2,000 members that: • • • • Encourages communication and information exchange among Tandem users Serves as an exchange for design concepts and software Establishes a forum for special interest groups such as banking, manufacturing, and transportation Provides feedback to Tandem regarding equipment and programming needs ITUG holds an inte
Overview of NonStop Operations Management Account Quality Planning (AQP) Service Account Quality Planning (AQP) Service The Tandem AQP provides services for improving your current operations management processes, including: • • • • Performing a profile assessment and analysis of your operations environment Identifying problem areas and targeting improvements for areas that will produce the most benefits for your organization Analyzing the root cause of problems Developing and implementing an action plan
2 The Operations Staff Overview Before receiving your Tandem NonStop system, you should determine what type of operations organization you will need, what type of training you should arrange for current staff, and what type of staff you need to hire (if any). This section provides guidelines to help you make these decisions. If you currently have Tandem NonStop systems, you might use these guidelines to reorganize your current operations staff.
Who Provides Each Level of Expertise? The Operations Staff Table 2-1 provides a general description of each level of expertise. The entry-level, intermediate-level, and senior-level skills and tasks are described in more detail in the following subsections. Who Provides Each Level of Expertise? Which staff members provide each level of expertise depends on the size of your organization.
Who Provides Each Level of Expertise? The Operations Staff Table 2-1. Staff Levels of Expertise Levels Description Entry-Level Tasks: Most basic tasks in each functional area. Most operations employees start by learning how to perform these tasks. Intermediate-Level Tasks: More complex than the entry-level tasks. Staff who performs intermediate-level tasks needs more in-depth knowledge and experience, and less supervision, than entry-level personnel. Senior-Level Tasks: Most complex tasks.
Staffing Levels Within the Production Function The Operations Staff Staffing Levels Within the Production Function The production function is divided into two activity areas: operations and support. The following paragraphs describe the staffing levels for these areas. Staffing the Operations Area The operations activity comprises a range of tasks and skills from entry level to senior level.
Staffing the Support Area The Operations Staff level operators who specialize in different areas of system operations, including network operations and teleprocessing. Other companies have several intermediate-level operators who are supervised by the most experienced intermediate-level operator (often called a lead operator). Senior-Level Skills and Tasks Staff performing senior-level tasks require an in-depth knowledge of Tandem system operations, system management, and products.
Staffing the Support Area The Operations Staff A person performing entry-level support tasks might be called an operations analyst or operations specialist. Intermediate-Level Skills and Tasks Staff performing intermediate-level tasks should have a good understanding of the Event Management Service (EMS), NonStop architecture, and Tandem products such as Expand, NonStop TS/MP, NonStop SQL/MP, and NonStop TM/MP.
Staffing Levels Within the Change Function The Operations Staff Staffing Levels Within the Change Function The change function is divided into two activity areas: planning and control. The following paragraphs describe the staffing levels for these areas. Staffing the Planning Area The planning activity comprises a range of tasks and skills from entry level to senior level. The tasks range from site planning and performance analysis to network design and application review.
Staffing the Control Area The Operations Staff Senior-Level Skills and Tasks Staff performing senior-level tasks require an in-depth knowledge of Tandem system operations, system management, and products.
Sample Operations Organizations The Operations Staff • Monitoring changes A person performing entry-level control tasks might be called a systems analyst. Intermediate-Level Skills and Tasks Staff performing intermediate-level tasks should have a good knowledge of Tandem utilities and systems, and should know how to find information in manuals.
A Small Operations Group The Operations Staff A Small Operations Group The example in Figure 2-1 shows an operations group that consists of entry-level through senior-level staff. Operations activities are performed by two computer room operators and one lead operator. The support, planning, and control activities are performed by a part-time operations specialist. The operations manager performs both planning and control activities and provides line-management support.
A Distributed Operations Group The Operations Staff A Distributed Operations Group The example in Figure 2-2 shows an operations group that supports a network with three nodes at different locations. The operations group consists of entry-level through linemanagement staff. The entry-level and intermediate-level staff perform operations activities and are distributed to the three sites. Sites A, B, and C each have one lead operator and a varying number of computer room operators.
A Centralized Operations Group The Operations Staff A Centralized Operations Group The example in Figure 2-3 shows an operations group that supports a network with three nodes at different locations. The operations group consists of entry-level through linemanagement staff, all located at Site A. There might be different numbers of personnel for each shift, depending on the size and complexity of the systems and network. Site A serves as the control node for the other nodes in the network.
A Telecommunications Group The Operations Staff A Telecommunications Group The example in Figure 2-4 shows a telecommunications group that is typical of organizations that support multiple vendors. The group consists of entry-level, intermediate-level, and line-management staff. The help-desk operators answer phone calls from all users. The teleprocessing operators perform intermediate-level tasks and maintain communications lines and modems. The manager and supervisors perform line-management tasks.
A Technical Support Group The Operations Staff A Technical Support Group The example in Figure 2-5 shows a centralized technical support group for Tandem systems within a large data processing environment. The technical support function can either be centralized into a single group or can become part of the various organizations in an existing data processing organization. Experience has shown that the centralized group approach is best.
Sample Job Descriptions The Operations Staff Sample Job Descriptions Once you determine how your operations staff should be organized, you can develop job descriptions for each staff member. By developing formal job descriptions, you can ensure that all levels of required support are provided. Following are sample job descriptions for each operations activity area. Note. The descriptions do not represent requirements or recommendations.
The Operations Area The Operations Staff • • • • • Monitor the physical environment in the computer room Monitor terminals, processors, communications equipment, applications, and console messages as instructed by other support levels.
The Operations Area The Operations Staff Help-Desk Operator Following is a sample job description of a help-desk operator who performs entry-level tasks. A former user who has a good telephone manner and remains calm under pressure is an ideal candidate for this position. Job Title Help-Desk Operator (Entry-Level Position) Summary of Responsibilities Help-desk operators answer phone calls and try to resolve user problems.
The Operations Area The Operations Staff External Contacts Help-desk operators interact with users, computer room operators, intermediate-level operators, and Tandem customer engineers (CEs). Tools/Equipment Telephones (preferably with headsets, conference call features, and recording machines), a problem-tracking and problem-escalation system, video terminals or workstations, documentation, and appropriate check lists should be available to assist help-desk operators in completing their tasks.
The Operations Area The Operations Staff • • • • • • • • • • • • • • • • Write and maintain operations check lists Take down and bring up devices Differentiate between hardware, software, and firmware problems Resolve terminal-related problems Manage disks Verify system integrity by switching devices to their primary and backup paths Manage file space Spare bad sectors on disks Handle processor failures Dump processors Reload processors Perform hardware diagnostics with the Tandem Service Management (T
The Support Area The Operations Staff Performance is judged by how quickly operators detect and solve system problems and by the number of problems escalated. External Contacts Lead operators interact with users, entry-level operators, senior-level operators, Tandem customer engineers (CEs), and Tandem analysts. Tools/Equipment Manuals, operator instructions and documentation, video terminals and workstations, and pagers should be available to help lead operators complete their tasks.
The Support Area The Operations Staff Detailed Duties and Responsibilities A detailed summary of duties and responsibilities includes: • Develop operational routines using DSM facilities, such as TACL, the Distributed Name Service (DNS), and the Subsystem Programmatic Interface (SPI); and special customized programs that: • • • • • • • • • • • • • • Process applications regularly or on an as-needed basis (for example, ad hoc reports) Monitor the system at regular times (for example, status checks or
The Planning Area The Operations Staff Tools/Equipment A full set of Tandem manuals, video terminals or workstations, system management utilities, and pagers should be available to help technical support specialists complete their tasks. The Planning Area Following is a sample job description for staffing the planning area. Senior Systems Planner Following is a sample job description of a senior systems planner who performs seniorlevel tasks in the planning area.
The Control Area The Operations Staff • • • Plan for implementation of upgrades and new releases Advise development on application design Assess, with the network or teleprocessing specialist, the impact of changing the communications environment Standards/Objectives Senior systems planners are evaluated on how well users’ level-of-service expectations are met and how well the system is operating.
The Operations Manager The Operations Staff • • • • • • Create and maintain appropriate hardware, software, and application configurations Install new and changed operating system images and software releases Develop and maintain programming and operational standards and procedures for the operations group Assist in providing quality assurance testing for new applications and new system software Participate in the evaluation of operations management software Advise development on application design Stan
The Operations Manager The Operations Staff Detailed Duties and Responsibilities A detailed summary of duties and responsibilities includes: • • • • • • • • • • • Set up the operations organization Manage the daily operations and staff.
Training The Operations Staff Training Once you determine how to allocate the support tasks among your department, you can evaluate your staff’s training needs. Training is available from many sources: Tandem Software Education, Tandem manuals, experienced people within your company, and other vendors’ manuals and classes. Tandem Education If you are receiving a Tandem system for the first time, your entire staff will need some training.
In-House Training The Operations Staff can also order Tandem manuals in book form. For a complete list of Tandem manuals, refer to the About This Collection document in the G01.00 TIM collection. In-House Training A very useful type of training is on-the-job training. On-the-job training is most effective when it is well planned and is most valuable for entry-level personnel.
Check List The Operations Staff Introduction to NonStop Operations Management– 125507 2- 28
3 The Operations and Support Areas Overview Before receiving your Tandem system, you need to prepare the operations environment. The operations environment includes both the operations and support areas. The operations area is where you locate the computer systems and peripherals (such as printers). The support area is where the operations staff are located. In some companies, the operations and support areas are the same; in other companies, the areas are separate.
The Operations and Support Areas • • Computer Room Environments Reserve storage space for data processing supplies, manuals, equipment, and archived material. Depending on your needs, you might want to plan a tape library. A tape library is a separate area or room that contains backup tapes, site update tapes, software release tapes, NonStop TM/MP online dumps and audit dumps, and any tapes required to run applications. Tape libraries help you store, organize, and protect information.
The Operations and Support Areas Computer Room Environments accessed easily and provide opportunities for entry or damage. Basement locations are at greater risk to damage caused by faulty plumbing and flooding.
The Operations and Support Areas • • • Office Environments Plan the computer room layout to increase the efficiency of work flow and personnel traffic. To allow for growth, consider selecting a computer room that provides enough space for the initial installation as well as for future expansion. If sensitive information is displayed or stored in the computer room, cover computer room windows. Uncovered windows might allow unauthorized personnel to obtain information.
The Operations and Support Areas • Physical Security Develop procedures for protecting the systems when an environmental system (such as air conditioning) malfunctions. The procedures should include the following information: • • • Whom to contact when a malfunction occurs How long the computer systems can run and when the staff should shut down the systems How to start available backup systems Physical Security Your company security policy will determine the physical security precautions you take.
The Operations and Support Areas • • System Installation Determine whether a voice alert system is needed. Voice alert systems send a message over a loud-speaker when major problems occur or when certain people are needed. Your Tandem representative can tell you what types of alert systems are available. Provide at least one telephone near the system cabinets and terminals to be used for operations.
The Operations and Support Areas Office Environments Your responsibilities during the installation process include: • Providing a suitable computer room, environment, and facilities in accordance with current published guidelines, including: • • • • Providing electric power to system cabinets and peripherals (CEs provide the specifications) Furnishing and testing AC power requirement as needed Connecting and testing all external communications equipment Furnishing all labor required for unpacking an
The Operations and Support Areas Both Computer-Room and Office Environments system for local or remote access. TSM diagnoses system problems as they occur and often detects failures before they affect the system’s performance. • • • • Make sure that air vents are not blocked. Make sure that all fire-detection and fire-extinguishing equipment works properly. Keep computer areas clean. Accumulated debris can cause accidents and fires. As appropriate, clean tape drives and printers regularly.
The Operations and Support Areas Computer Room Environments Computer Room Environments In computer room environments, incorporate the following maintenance tasks into the staff’s regular routine: • • • • Replace air-conditioning filters at regular intervals to prevent hardware from overheating or failing. Monitor the computer-room temperature and humidity constantly. Know how to turn off power should the temperature or humidity rise beyond the point considered safe for your systems.
The Operations and Support Areas • • • Check List Provide at least two terminals reserved for system monitoring and problem resolution Provide at least one terminal in which the command interpreters run at a high priority (for example, 199) Connect the terminals to different controllers to reduce the risk of losing system access if a controller fails If your group has a help-desk function, you might also have to provide help-desk equipment, such as: • • • • Problem report forms • • A list of contact
The Operations and Support Areas Check List 5. Once the site has been prepared and all necessary requirements met, install the systems. Most new computer room systems are installed by Tandem customer engineers (CEs). 6. Plan for preventive maintenance: • • • Develop procedures and schedules Arrange for CE support if needed Keep all equipment and work areas clean 7.
4 Operations Documentation Overview Operations documentation can help your operations organization perform efficiently and effectively. This section lists and describes the types of documentation often used in an operations environment. A documentation check list is provided at the end of this section to help you select the appropriate operations documentation for your environment.
Operations Documentation • • • • • • • • Service-Level Agreements Job duties and performance standards for each staff member, each operations group, and the complete operations organization. The company’s disaster recovery plan and the procedures the operations staff should follow to implement the plan. Naming conventions for systems, volumes, subvolumes, files, devices, event filters, and programs. Standardized names make it easier for you to find files, monitor programs, and solve problems.
Operations Documentation • Creating Service-Level Agreements The acceptable level of service when the system is stressed, such as during peak workloads, partial equipment failures, and medium-term increases in workload. Creating Service-Level Agreements When creating service-level agreements, following a few simple steps can help ensure that the operations services provided match the needs and expectations of your users.
Operations Documentation Agreements, Contracts, and Supporting Documents As an alternative, you can specify a check list. This method is especially effective for measuring services. For example: • • • Monthly preventive-maintenance procedures for G-series systems will include the following. . . Proposals for system configuration changes will follow the outline provided and include review signoff from the following groups. . .
Operations Documentation Configuration Diagrams and Listings Configuration Diagrams and Listings Configuration diagrams and listings help the operations staff monitor systems, recognize problems, and prepare for configuration changes. Useful diagrams include: • • Network diagrams. Network configuration diagrams show the network nodes and the lines that connect the nodes. On the network diagram, you can also specify the specific lines such as Expand, SNAX, X25AM, and so on. System diagrams.
Configuration Diagrams and Listings Operations Documentation Figure 4-1.
Configuration Diagrams and Listings Operations Documentation Figure 4-2.
Operations Documentation Configuration Diagrams and Listings Some of the important configuration listings include: • • The Subsystem Control Facility (SCF) INFO command. Use the SCF INFO command to display system configuration information for a specified device object (for example, DISK, TAPE, or ADAPTER), including the current attribute values for that object. Refer to the SCF Reference Manual for the Storage Subsystem for a detailed description of the SCF INFO command. The SCF STATUS command.
Operations Documentation Flow Diagrams to a configuration. In addition to the packaged reports, you can use the SQL report writer to design and produce custom reports whose contents are tailored to your specifications. • • • Application configuration listings. These listings help the staff monitor applications and ensure that all required processes are running. Database configuration listings. These listings show the major databases and the applications that access them.
Flow Diagrams Operations Documentation Figure 4-3.
Flow Diagrams Operations Documentation Figure 4-4.
Operations Documentation Tandem Manuals Tandem Manuals Manuals provide you with information about your system hardware and software. They provide introductory, procedural, and reference material. You should have a complete set of Tandem manuals for the Tandem products you use, in addition to manuals from other vendors. You should also have internal manuals. Appendix A, “Additional Reading,” lists the manuals that describe the tasks and products mentioned in this manual.
Operations Documentation Operator Logs Operator Logs Operator logs provide a history of problems encountered during each work shift, a record of unresolved problems, and a record of tasks scheduled and performed or not performed. The logs help the staff track tasks and determine why problems occur. Operator logs are maintained by the operators.
Operations Documentation Outage Logs The log book should remain by the system console or system cabinets. Outage Logs Tandem recommends maintaining outage logs to help you assess system availability. Outage logs provide a history of any failure or upgrade that causes a system outage. This historical data can be used for trend analysis, which in turn can be used to determine where improvements are needed.
Outage Logs Operations Documentation Figure 4-5.
Operations Documentation Internal Operator Guides Internal Operator Guides Even though Tandem provides the manuals you need to run your system, you might also need internal operator guides. Internal operator guides (also called runbooks) describe the procedures required for a particular site or a particular organization. These procedures can be copied from Tandem manuals and can be tailored to the needs of your organization.
Operations Documentation Online Files Online Files Online files are needed to perform many operations tasks. To ensure that the staff can quickly find the files required, consider documenting the location of the files needed to run the system.
Operations Documentation • • • • • • • • Check List System-configuration and network-configuration listings Flow diagrams Tandem and other vendor manuals Tandem software release documents Internal operator guides Documentation on the location and use of online files Cause, effect, and recovery information for application error messages Anything else applicable to your operation 2. Place the documentation where it is accessible to those who need it. 3.
5 Production Management Overview This section describes “production management” and provides guidelines to help you manage the day-to-day support and operations tasks in the production environment.
Monitoring System Status Production Management Monitoring System Status To ensure that the system is operating properly and to recognize when corrective action is required, it is important to monitor the status of all the resources of the system and network. Monitor on a continuous basis. Resources include processors, cabinets, disks, paths, volumes, controllers, communication lines, Expand lines, transaction-processing servers, terminal control processes (TCPs), terminals, spooler devices, and programs.
Tracking System Usage Production Management • • Reduce operator errors React to problems more quickly and perhaps more accurately than an operator could For more information on automating operations, refer to Section 12, “Automating and Centralizing Operations.
Step 1—Establishing a Strategy Production Management Step 1—Establishing a Strategy The accounting strategy is based on the service-level agreements.
Providing Daily and Weekly Reports Production Management Reporting the results on a weekly basis helps track system and network resource usage and can provide the capacity planners with data useful for forecasting future demands. Providing Daily and Weekly Reports The operations staff should generate daily and weekly reports containing statistical information about how the Tandem systems are operating.
Creating a Production Schedule Production Management Creating a Production Schedule Perform the following tasks to create a production schedule: 1. Use a 24-hour clock worksheet (like Figure 5-1) to list all tasks that are performed daily. 2. Identify what task (either business or operations and management) will be performed and who (either an automated process or a person) will perform it. 3.
Management Responsibilities Production Management Figure 5-1.
Routine Operations Tasks Production Management • • Recognize the types of documentation that should be available to your staff, including daily run sheets, configuration listings and diagrams, manuals, and so on. For more information about documentation requirements, refer to Section 4, “Operations Documentation.” Evaluate and select hardware, software, and tools. Routine Operations Tasks The following check lists of routine tasks can help you determine what types of tasks your staff needs to perform.
Processor Dump Production Management Processor Dump A processor dump is performed to copy the contents of a processor’s memory onto disk or tape. Although a processor dump is useful for most processor or system failures, in some cases you should contact your Tandem representative first. Processor Reload A processor reload is performed to bring up a processor in a running system. System Shutdown System shutdown is performed to bring down a running system in an orderly manner.
Daily Tasks Production Management Daily Tasks The following types of daily tasks help you ensure that the system is running properly and that potential problems are detected early: • • • • Start-of-day tasks. These tasks are performed by the first shift of the day. Start-of-shift tasks. Every operations work shift performs these tasks at the start of the shift. During-the-shift tasks. Every operations work shift performs these tasks periodically during the shift. End-of-day tasks.
Start-of-Shift Tasks Production Management • • • • • • • • • Make sure that the temperature and humidity are normal. Check the spooler. Use the spooler interface (SPOOLCOM) to make sure that the spooler components are working properly. Check for operator messages: • • • • Check the operator message logs, TSM EMS Event Viewer screens, and your own system management applications for operator messages and event messages. Read all messages that appear in order to detect potential problems early.
During-the-Shift Tasks Production Management • Check files: • • • • • Note any bad tracks on mirrored disks and spare the bad tracks (use SCF). Make sure that sufficient free space is available for dynamic extent allocation during the day, and monitor space fragmentation. Use the LISTFREE function of the Disk Space Analysis Program (DSAP). Look for excessive file fragmentation or exceptional conditions such as extent overlaps, unspared defective sectors, or lost free space pages. Use DSAP.
End-of-Day Tasks Production Management • • Check printer supplies. Make sure that a supply of printer paper and ribbons is always available. Perform preventive maintenance on all tape drives. End-of-Day Tasks The following tasks are usually performed during the last shift of the day: • • • • Stop Measure data collection and generate system performance reports. Perform NonStop TM/MP audit dumps. Check equipment: • Clean the cabinets as needed.
Monthly Tasks Production Management • • • • • • • • Maintain inventory records of hardware. Order supplies. Summarize daily statistics: • • • • • The number of problems reported, resolved, or still unresolved The amount of time and the levels of support required to solve problems The number of users added and deleted from the system The number of terminals installed or fixed Any other useful information Prepare security audit reports.
Recovery Procedures Production Management • • • • Back up and restore complete disks. Make sure that the on-site and off-site archives are current. Reload application files (might be required more or less frequently, depending on the applications). Perform system preventive maintenance. (Your Tandem support representative might perform this task, depending on your support contract.) Review weekly performance reports. Determine if additional system capacity will be needed.
Recovery Procedures Production Management • • • • • • • • • • • • • Review the guidelines provided in Section 10, “Contingency Planning.” Having a disaster recovery plan in place can help you and your staff recover from a disaster as quickly as possible, with minimal damage to your system and data. Document recovery procedures and make them available to operations and support staff. Information on how to determine the cause of a problem should also be documented and available.
Production Management Tools Production Management Production Management Tools Tandem provides a number of tools to help your staff with production management tasks. Table 5-1 summarizes the production management tools and their capabilities. For detailed descriptions of these tools, refer to Section 14, “Operations Management Tools.” For a list of automation tools, refer to Section 12, “Automating and Centralizing Operations.
Check List Production Management Check List The following check list summarizes the main points of this section: 1. Implement operator tools to monitor the systems, networks, applications, processors, disks, and communications lines. 2. Determine which tools to use. Tandem offers these tools: • • • • • • • • • • • • DSM/NOW EMS EMSA NetBatch and NetBatch-Plus NSX OMF SeeView SCF TSM TSM EMS Event Viewer VHS ViewSys 3.
6 Problem Management Overview No matter how well-managed your system is, errors and problems can occur. Because a problem can mean the loss of availability, your staff needs to know how to report and resolve the problem. If your staff cannot resolve the problem, it must know how to escalate the problem so that recovery occurs. This section describes “problem management” and provides suggestions, guidelines, and tools for administering problems in an operations environment.
Management Responsibilities Problem Management Table 6-1. Unplanned Outage Classes Outage Class Description Physical Physical faults or failure in the hardware. Examples include system disk failure and network router failure, nonfaulttolerant hardware configurations (such as unmirrored disk drives), and nonfault-tolerant application configurations. Design Design errors such as bugs in design and design failure in hardware or software.
Problem Management Providing Outage Prevention and Recovery Training Providing Outage Prevention and Recovery Training Providing outage prevention and recovery training can help the operations staff become more aware of the concept and cost of outages, and promotes outage prevention habits.
Problem Prevention Strategies Problem Management Problem Prevention Strategies You can prevent many problems by implementing the following strategies: • Monitor the hardware and software. To ensure that the system is operating properly and to recognize when a potential problem might occur, it is important to monitor continuously the status of all the resources of the system and network.
Problem Prevention Strategies Problem Management Application Design provides guidelines for designing applications for high availability. • • Ensure the availability of super-group (255, n) capabilities. While a super-group logon is not needed under normal conditions, it may be required to solve certain problems. Having access to a super-group password is sometimes the fastest way— and even the only way—to solve a problem.
Recovering From Problems Problem Management Recovering From Problems Despite the best planning and prevention, problems can still occur. To get your system or application back online quickly after an unplanned outage, it is important to organize and analyze problem information. By implementing systematic problem-solving techniques, you will be able to pinpoint the cause of a problem and resolve the problem in a timely and efficient manner. Systematic problem solving consists of five steps: 1. 2. 3. 4. 5.
Step 1—Detecting and Isolating the Problem Problem Management Step 1—Detecting and Isolating the Problem To detect problems quickly, operators must be aware that a problem exists. Some of the same techniques used to predict and prevent problems are also used to determine if a problem exists. These are: • • • • • Monitoring hardware and software. Monitoring system and application software message logs. Using Tandem Service Management (TSM) tools, including the TSM EMS Event Viewer.
Step 2—Gathering the Facts and Reporting the Problem Problem Management Following are suggestions for problem reporting requirements: • Develop a standard online or hard-copy problem report form to log problems. Require that all problems be documented with this form. The form should record the facts about the problem and the facts about the situation surrounding the problem. Facts about the problem include what happened, where, when, and the magnitude of the problem.
Step 2—Gathering the Facts and Reporting the Problem Problem Management Figure 6-2.
Step 3—Identifying the Cause and Developing and Implementing a Solution Problem Management Step 3—Identifying the Cause and Developing and Implementing a Solution Using the information obtained when reporting the problem, you are in the position to speculate about what caused the problem and to develop a solution. The following paragraphs provide guidelines for determining the cause of a problem and developing a solution.
Step 4—Escalating the Problem (If Necessary) Problem Management In-Company Problem Escalation Procedures Problem escalation procedures help you ensure that problems are escalated to the correct people in a timely manner. When establishing problem escalation procedures, consider the following: • • • Easy-to-fix problems should be solved by the lowest level of support. This allows higher levels of support to spend time on more complex problems and on other tasks.
Step 5—Reviewing the Problem Problem Management to solve the problem. If the analyst or CE cannot solve the problem, he or she escalates the problems to the district level, then to the regional level. Step 5—Reviewing the Problem When a problem is resolved, the solution can be recorded, and the problem report can be closed. Reviewing problems and solutions can help the staff prevent the same problems from recurring.
Business and Operations Activities Problem Management Figure 6-3. Case Study: Just For Children, Inc. (JFC) Computer System Headquarters Dial-Up Lines Communications Controller Portland Himalaya S-Series Server €€€ Reno Leased Communication Lines Sacramento JFC Retail Stores Communications Controller Cluster Controllers €€€ $WHS2.#TRM1 $WHS2.#TRM7 Warehouse Telephone Order Department Cluster Controllers €€€ €€€ $WHS4.#TRM15 $WHS4.
Problem Scenario Problem Management Problem Scenario It is late fall and the holiday season is fast approaching. JFC’s business is on the upswing. The busiest (and most profitable) season of the year is approaching. On Tuesday morning at 8:00 a.m., the operations group gets a call from the manager of the Telephone Order department at the warehouse, indicating that a terminal is down.
Determining the Cause and Resolving the Problem Problem Management • • The temporary workers often neglect to power down the terminals after their shift, leaving the terminals on overnight. (This has been against company policy, since the start of the energy conservation program.) In both situations, the terminals were plugged in and the cable connections were solid.
Determining the Cause and Resolving the Problem Problem Management Figure 6-4. Problem-Solving Worksheet PROBLEM-SOLVING WORKSHEET Problem Facts Possible Causes Terminal Hardware Terminal Comm. Config. Lines System Controller TACL Move What? 2 terminals down $WHS2.#TRM7 Yes Yes Yes Yes Yes Yes $WHS4.#TRM20 Yes Yes Yes Yes Yes No $WHS2.#TRM7 on east wall Yes Yes No No Yes No $WHS4.#TRM20 on west wall Yes Yes No No Yes Yes One on Tuesday at 8:00 a.m.
Problem Management Tools Problem Management Problem Management Tools Tandem provides a number of tools to help your staff with problem management tasks. Table 6-2 summarizes the problem management tools and their capabilities. For detailed descriptions of these tools, refer to Section 14, “Operations Management Tools.” Note. For a list of automation tools, refer to Section 12, “Automating and Centralizing Operations.” For a list of performance tools, refer to Section 8, “Performance Management.
Check List Problem Management Check List The following check list summarizes the main points of problem management: 1. Maintain a well-trained operations and support staff. 2. Establish problem prevention strategies.
Check List Problem Management 7. Establish procedures for reviewing problems: • • Periodically meet with your staff to review solved and unsolved problems and to determine if improvements in the procedures can be made to prevent the same problems from occurring in the future. Generate reports to provide statistics on the number of problems encountered, solved, and not solved, and on the time and levels of staff required for problem resolution. 8. Determine which tools to use.
Check List Problem Management Introduction to NonStop Operations Management– 125507 6- 20
7 Change and Configuration Management Overview Systems and software often change. For example, you might add hardware to a system, update applications, or install a new release of the operating system. These changes can increase the effectiveness of your operations, or they can create confusion and problems, depending on how your organization handles the changes.
Change and Configuration Management The Goals of Change and Configuration Management Change and configuration management encompasses the following major areas, which are described briefly in this section: • • • Anticipating and planning for change Installing and implementing changes to system software and hardware, application subsystems, communications subsystems, and application software Controlling the introduction of change For detailed descriptions of these topics, refer to the Availability Guide
Change and Configuration Management Management Responsibilities Management Responsibilities The change-management and configuration-management functions are most effective when policies and procedures are developed and enforced, and the staff is trained.
Change and Configuration Management Anticipating and Planning for Change Anticipating and Planning for Change By taking the time to anticipate and plan for change, you can avoid taking your system down unnecessarily. Planning for change is especially important in environments that require 24-hour-a-day, 7-day-a-week operations. You can anticipate and plan for change by: • Evaluating system performance and growth.
Change and Configuration Management • • • • • • • Performing System Configuration Changes Make sure that there is sufficient power. If there is not, schedule time for adding the power. You might need to add power sockets. Make sure that there is sufficient air conditioning. If there is not, schedule time for improving the air-conditioning system. Determine whether you need special cables, modems, or cabinets for communications equipment. Determine whether the change requires down time.
Change and Configuration Management • Performing Subsystem Changes SCF allows you to add and change configurations for many (not all) device types while the system is running. SCF also reduces the need to preconfigure software changes. Performing Subsystem Changes The Tandem environment consists of application and communications subsystems. The application subsystems enable you to develop and run high-performance, high-volume, and highly available OLTP applications.
Change and Configuration Management Performing Software Changes Once it is determined that the changes will not affect system security, the staff can prepare to install the software. The following guidelines can be incorporated into a preinstallation check list: • • • • • • • Determine whether the change requires a new system configuration or reconfiguration of applications. Determine whether the change requires down time. If so, schedule the down time with operations and notify users.
Change and Configuration Management Controlling the Introduction of Change Controlling the Introduction of Change Change occurs all the time. If you do not control who makes the changes and when, you might put your system at risk. If there are no controls, frequent changes and changes by unauthorized personnel might threaten the stability of the system.
Change and Configuration Management • The Change Control Process Reviewing the process. Continual process improvement should be an integral part of the change control process.
Change and Configuration Management Case Study Case Study Effective planning and managing of change minimize risk and help ensure that customer service levels are met. The following case study shows how a New England bank implemented change control procedures to manage their growing environment. User Profile Allied Bank is a financial institution with 370 offices located in the New England area. Current assets exceed $18 billion dollars.
Change and Configuration Management Implementation of Recommendations Implementation of Recommendations Allied Bank had developed a few change procedures and had assigned one person to coordinate operating system migrations. Although this approach was somewhat effective, it did not provide for ongoing change management. A formal method needed to be developed to manage all change, not just major changes.
Conclusion Change and Configuration Management Figure 7-1.
Change-Management and ConfigurationManagement Tools Change and Configuration Management Change-Management and ConfigurationManagement Tools Tandem provides a number of tools to help your staff with change-management and configuration-management tasks. Table 7-1 summarizes the change-management and configuration-management tools and their capabilities. For detailed descriptions of these tools, refer to Section 14, “Operations Management Tools.
Change and Configuration Management Check List Check List The following check list summarizes the main points of change and configuration management: 1. Obtain management commitment to developing policies and procedures, training staff in the policies and procedures, and enforcing policies and procedures. 2. Determine your staffing needs. 3. Anticipate and plan for change by: • • Evaluating system performance and growth to accommodate change.
8 Performance Management Overview Performance management helps you ensure that you get the best return from your systems and that the systems meet your business needs. This section provides guidelines for performance management tasks.
Performance Management • Capacity planning is the process of forecasting future capacity needs based on performance trends and the growth in users, applications, and your company’s business. Capacity planning helps you: • • Service-Level Agreements • • Plan for growth in system workloads based on business growth.
Performance Management Staffing Staffing Staffing needs depend on the size of your company and the number of systems and applications your company runs. The larger the company and the greater the number of systems and applications, the greater the number of people who are involved in performance management. If you have a small operations group, you might need to assign only one person to the function. If you have a large operations group, you might need to assign a group (or groups) to the function.
Performance Management • Application Sizing The performance-analysis-and-tuning staff should understand how the system and applications are structured and know how to: • • • Collect and analyze statistics Identify peak periods of resource usage Identify and correct current performance problems Tandem provides training in performance analysis and tuning. Currently, Software Education offers a course called Performance Analysis and Tuning. Note.
Performance Management Step 3—Reporting Results Step 3—Reporting Results Sizing results are usually reported to the capacity-planning staff and management. Useful reports describe the sizing staff’s assumptions, describe modeling results, list alternatives, and provide recommendations. Capacity Planning Capacity planning consists of four steps: 1. 2. 3. 4.
Performance Management Step 3—Forecasting Step 3—Forecasting Forecasting consists of the following tasks: • • • Developing a model of the current system that reflects current performance characteristics. Using the information provided by the application-sizing staff and by business analysts to estimate projected workload volumes and workload profiles for the next period (for example, the next two years).
Performance Management • • • Step 2—Gathering Performance Information Acceptable performance goals, such as acceptable system response times, deadlines for batch jobs, and transaction volumes. Acceptable level of service when the system is stressed, such as during peak periods and during partial equipment failures. The priorities for resource allocation.
Performance Management Step 4—Optimizing System Performance Step 4—Optimizing System Performance Optimizing system performance involves: • • • • Analyzing the results of the performance measurements Identifying load imbalances and bottlenecks created by applications Determining what should be done to improve system performance Improving system performance by tuning the system, balancing the system, performing online performance troubleshooting, or reconfiguring the system The following guidelines can h
Performance Management • • • Step 5—Reporting Results Each change made to the system affects a number of different system resources. Introducing change in a gradual, planned manner allows you to observe the systemwide effect of a change before trying more changes. If multiple measurements are performed concurrently, the measurements should be coordinated through a single group at each site.
How It Fits Together Performance Management How It Fits Together Figure 8-1 shows the relationship between application sizing, capacity planning, and performance analysis and tuning. Figure 8-1.
Performance Management Analysis of Problem and Recommendations Analysis of Problem and Recommendations Because of its growing environment, SJ County Medical is often faced with system performance problems that require action by the performance-analysis-and-tuning (PAT) staff. Over time, the PAT staff has developed a few techniques for accommodating hardware growth, alleviating system performance problems, and minimizing down time.
Performance Management Tools Performance Management Performance Management Tools Tandem provides a number of tools to help your staff with performance management tasks. Table 8-1 summarizes the performance management tools and their capabilities. For detailed descriptions of these tools, refer to Section 14, “Operations Management Tools.
Performance Management Check List 1. Establish service-level agreements. 2. Assign staff to the capacity-planning, application-sizing, and performance-analysisand-tuning functions. Provide training as needed. 3. Establish procedures for application sizing. Typically, the application-sizing staff: a. Establishes sizing requirements and strategy b. Forecasts and develops models of future demands c. Reports results 4. Establish procedures for capacity planning. Typically, the capacity-planning staff: a. b.
Performance Management Check List 6. Select tools to help with performance management.
9 Security Management Overview Data is a vital and irreplaceable part of every business. However, data protection is a difficult task. Not only do you have to protect the data, but you also have to protect everything that allows people to access the data, including the computer equipment, storage media, the operating system, and application software. This section describes “security management” and provides suggestions, guidelines, and tools for administering the security process.
Basic Security Rules Security Management Basic Security Rules Before determining how to secure your hardware and software, you should understand the following basic security rules. Use these rules when establishing your security program or when reviewing a program that is already in place. Rule 1 The highest levels of management should support and be committed to a security program.
Developing a Security Policy Security Management Rule 8 The staff responsible for security monitoring and auditing should periodically review adherence to security rules. Develop or acquire audit tools and reports to support this activity. Rule 9 The security program must have integrity. Actively test the validity of the security program, the physical barriers, the administrative practices, and the hardware-protection and software-protection mechanisms.
Security Guidelines Security Management Security Guidelines Your security policy might range from permissive to restrictive. Initially, it is most helpful to use a somewhat restrictive approach, because it is difficult to tighten security practices once users become accustomed to a permissive approach. Security concepts that can guide your security policy are: • • Least privilege Baseline security Least Privilege Least privilege dictates that users access the system only when they need to.
Staff Support Security Management Staff Support Separation of security duties helps you avoid collusion and helps you ensure that your system is well secured. Security administration duties are usually divided between an auditor, a security administrator (or administration team), and the operations staff. Depending on your organization’s structure, the security administrator might also be a member of the operations staff. • • The auditor is responsible for auditing the system.
Organizational Issues Security Management Organizational Issues Good security requires that people communicate and cooperate across organizational lines. Figure 9-1 shows the paths of communication needed to sustain a strong security effort. Figure 9-1.
The Tandem Security System Security Management The Tandem NonStop Kernel operating system and its utilities offer basic system protection. The security software product Safeguard extends security features to include auditing, extended access-control, authentication features, and segregation of administrative tasks. If you are using the OSS environment, you must use the Safeguard software to define users for your system.
Authentication Services Provided by the Tandem NonStop Kernel Security Management Authentication Services Provided by the Tandem NonStop Kernel The Tandem NonStop Kernel operating system has built-in security that uses passwords for authentication and security strings to control access to files. Utilities, including the File Utility Program (FUP) and the Disk Space Analysis Program (DSAP), help you control and monitor system security.
NonStop SQL/MP Security Management • • German Information Security Agency (GISA), F2/Q3 security-function and F7/Q3 system-availability levels Harmonized European Information Technology Security Evaluation Criteria (ITSEC), E3 level Note. The suggestions in this section are based on the assumption that you use the Safeguard product to help protect your systems. If you do not use the Safeguard product, you should seriously consider doing so.
Environmental Controls Security Management from the room, putting locks on the doors, and not posting signs that indicate the location of the computer facilities. Environmental Controls Access to the power supply and the air conditioning can provide ample opportunity for accidental or malicious damage. Consider controlling access to the power supply and the air conditioning by locking the control panels. System Cabinets Protect the system cabinets from accidental damage and deliberate malicious acts.
Data Encryption Security Management carefully screening all who request materials, allowing access to approved persons only, and creating explicit hand-over procedures between the storage-area staff (especially staff on contract) and your staff. Data Encryption If you cannot provide physical security for data, consider encrypting the data so that intruders cannot easily access the data.
Access-Control Lists (ACLs) Security Management Access-Control Lists (ACLs) Depending on your organization’s security policy, you might have to restrict access to system software so that only selected users or user groups can execute the software. To restrict access, use Safeguard access-control lists (ACLs). Safeguard ACLs allow you to specify exactly which users have access to what files. The Safeguard product maintains ACLs for all objects under its protection.
Special User IDs Security Management Table 9-1. Classes of Special System Users Users Typical User Name User ID Super ID SUPER.SUPER 255,255 Super-group user SUPER.user-name 255,n Group manager group-name.MANAGER n,255 The Super ID Users with the super ID (255,255) can access all data and devices, and they can log on as any user without knowing the user’s password. You can use the Safeguard product to restrict some of the super-ID capabilities.
Special User IDs Security Management The purpose, use, and dangers of the super ID (255,255) are fully described in the Security Management Guide. Note. In the Open System Services (OSS) environment, the super ID has the user ID 65535 and has the set of special permissions called appropriate privileges. The Guardian user ID (255,255) is the same user ID as the OSS user ID 65535.
Guest-User IDs Security Management Guest-User IDs You can provide a guest-user ID on your system. A guest-user ID makes your system temporarily available to people who must have physical access to your system, but who do not need long-term access. Before providing a guest-user ID, consider these points: • • Keep the user ID as unprivileged as possible. For example, the guest-user ID should not have access to any sensitive files or system resources.
Reusing User IDs Security Management • • • • Evaluating the risk to an unencrypted password database, and, if necessary, changing all passwords to an unencrypted password database the user had access to. Changing the guest-user ID if your system has guest-user IDs. If the person is merely moving to a different group and the members of the group are still allowed to use your guest-user ID, this change might be unnecessary. Removing references to the user ID from Safeguard access-control lists.
Setting Unexpected Initial Passwords Security Management Setting Unexpected Initial Passwords Don’t derive initial passwords from the user name or user ID, since an inside intruder might log on to a user ID that has been created but not yet assigned. Enforcing Routine Password Changes You can use the Safeguard product to force a password to expire after a specified time. This Safeguard feature motivates people to change their passwords before the expiration date.
Authorization Lists Security Management Authorization Lists Use authorization-list software to limit dial-up access to a designated subset of the user community. The Safeguard product provides this ability. Additional External Passwords Some systems demand an additional system-wide password during the dial-up logon sequence. The system password is roughly the dial-up equivalent of allowing physical access to the main work site.
Security Precautions Security Management requires advance network-wide planning. As part of your planning effort, you should consider: • • Reserving a range of group numbers (for example, 200 to 254) for network user IDs, and assigning network user IDs from these groups. Deciding on the network-wide names for the groups on an as-needed basis, maybe even reserving a particular initial letter (like N) for network groups.
OSS System Security Security Management • • • • Authenticates the user by using smart cards or personal identification numbers (PINs). Decides what servers the user is entitled to use. Passes the personal ID when it calls the server. Resides on a diskless workstation. Diskless workstations can prevent information from being copied to a floppy disk and removed or from being left where someone might break into the workstation to access the hard disk.
Special Security Concerns Security Management File-Sharing Groups File-sharing groups are particularly important in the OSS environment. Each user has a group list that contains the names of all groups to which that user belongs. When the user attempts to access a file, the file’s group permissions are granted to that user if the user’s group list includes the name of the file’s group.
PROGID Programs Security Management Implications for Your Security Policy Your security policy should establish guidelines for: • • File security during the development process. If the development environment and production environment are on the same system, create separate production disk volumes or subvolumes. If the Safeguard product is installed, secure the volumes and subvolumes so that developers do not have create or write authority to production files.
Licensed Programs Security Management Possible Hazards Inappropriate design of PROGID programs can result in serious security holes: • • • • • Without sufficient checking of the input data range and form, an incompletely debugged PROGID program can unintentionally provide unauthorized access to restricted data. The privileges of a PROGID program propagate to any processes created by the program.
Licensed Programs Security Management anywhere in the system, disrupt the system, disrupt the network, and do anything that the super ID (255,255) can do (including license another program).
Check List Security Management Check List The following check list summarizes the main points of security planning: 1. Develop a security policy for your organization. 2. Educate the user community and the operations staff about security and their responsibilities for protecting the system. 3. Designate a security administrator and a security administration team to manage security. Set up check lists for the administrator and team members. 4.
Check List Security Management • • • • • Reserve a range of group numbers for network user IDs, and assign network user IDs from these groups. Decide on the network-wide names for the groups on an as-needed basis. Designate a particular organization to own each group name and group ID, and make that organization responsible for controlling the allocation of user IDs within its group. Determine what applications and users can use network IDs. Consider using encryption devices.
10 Contingency Planning Overview Contingency planning can help you prevent, prepare for, and recover from a disaster. Disasters can occur any time and anywhere. In companies where day-to-day business activity is tied to a computer system, a sound recovery plan is imperative. Planning ahead can help you prevent some disasters and to respond to those disasters you cannot prevent.
Computer Center Location and Facilities Contingency Planning • • • Network and system configurations Data recovery and integrity Data archiving procedures Computer Center Location and Facilities Review Section 3, “The Operations and Support Areas,” to ensure that your computer center and systems are protected. If you follow the guidelines in Section 3, you can avoid many disasters such as flooding, fires, and illegal access—or at least minimize the damage that such adversities would cause.
Data Recovery and Integrity Contingency Planning NonStop Access for Networking provides alternate paths to guard against local area network (LAN) failure in client/server topologies. Most Tandem systems are delivered with a fault-tolerant configuration. It is up to you to maintain a fault-tolerant configuration whenever you change or add hardware. When changing the configuration, follow the guidelines described in Section 7, “Change and Configuration Management.
Disaster Recovery Planning Contingency Planning Disaster Recovery Planning Disaster planning is a major undertaking and a team effort.
Step 1—Taking Inventory Contingency Planning Figure 10-1. The Disaster Planning Process Gain Support of Executive Staff Form Planning Team 1. Take Inventory 2. Develop the Plan 3. Test the Plan and Train the Staff 4. Revise and Update the Plan as Needed 032 Step 1—Taking Inventory As a first step toward preparing a recovery plan, the planning team usually determines what is at risk and prioritizes the risks. Taking inventory involves answering these questions: 1.
Step 2—Developing the Plan Contingency Planning 6. Is insurance available? Should your company purchase insurance for loss of equipment or business? 7. What are the recovery alternatives, the costs associated with each alternative, and the best alternative for your needs? Recovery alternatives usually include the use of a backup site. For a description of backup site options, refer to “Backup Sites,”later in this section.
Step 2—Developing the Plan Contingency Planning Figure 10-2. Damage-Assessment Team Responsibilities Hardware Software and Data Staff Damage Assessment Facilities Site 033 • Command-post information and procedures. Communication is vital to successful recovery. A command post serves as the focal point of disaster recovery. The command post is responsible for coordinating all activities and receiving and disseminating information internally and externally.
Step 2—Developing the Plan Contingency Planning • A list of all materials and services that must be available during a disaster, along with information on how to access the materials and services. Following are items that should be available: Note. Contracts and service agreements with third parties might be required for some of these materials and services. • • • • • • • • • • • • • • Priority Tandem hardware shipments and Tandem analyst support.
Step 3—Testing the Plan and Training the Staff Contingency Planning • • • Backup site procedures. If your company has a backup site, the planning team should document the procedures for moving to the alternate site. For more information about backup sites, see “Backup Sites,” later in this section. Procedures for reestablishing operations in the primary site or at a new permanent site.
Backup Sites Contingency Planning Backup Sites An important part of developing a recovery plan is determining whether or not your company needs a backup site. A backup site is a second site that is available for use when a disaster stops operations at your primary site. Depending on the type of backup site, you can restart operations at the backup location within 10 minutes to 30 days. Your company can maintain the backup site, or pay another company to maintain the site.
Data-Ready Sites Contingency Planning Archived data is sent to the operational-ready site but is not loaded onto the system until a disaster occurs. During a disaster, you convert an operational-ready site to primary-processing status by: • • • Backing up and removing the low-priority processing Loading the archived data Starting the necessary applications Plan on one or more days to convert an operational-ready site into a primary-processing center.
Determining Which Type of Backup Site Best Meets Your Needs Contingency Planning Table 10-1. Backup-Site Alternatives: Advantages and Disadvantages (page 1 of 2) Backup Site Advantages Disadvantages Cold Site Inexpensive way to acquire or lease a second computer site. No equipment or operating costs until a disaster occurs. Can require 20 days or more to become operational. Everything from furniture to computers must be ordered, delivered, and installed.
Determining Which Type of Backup Site Best Meets Your Needs Contingency Planning Table 10-1. Backup-Site Alternatives: Advantages and Disadvantages (page 2 of 2) Backup Site Advantages Disadvantages Mutual Backup Site May be least expensive way to establish a backup site. Requires less capital investment. Realistic recovery plan can be tested. During nondisaster periods, site may be shared by participants for development work.
Check List Contingency Planning Check List The following check list covers the main points of disaster prevention and recovery planning. 1. Take preventive steps to limit risks of disaster: a. Select the best site possible for your organization: • • Should the site should be located at a remote, computer-only site, or with other business operations? Is the site away from known danger zones? b. Select or design the best facility possible.
Check List Contingency Planning • What are the recovery alternatives, the costs associated with each alternative, and the best alternative for your needs? b.
Check List Contingency Planning Introduction to NonStop Operations Management– 125507 10 -16
11 Application Management Overview Applications are key to the operation of many businesses. The cost of an unavailable application can result in: • • • • Revenue loss. Many companies sell their ability to deliver services any time of the day or night. If the application responsible for providing the service is unavailable, the customer might call another supplier. Lost productivity. Information-based companies rely on computer applications.
Establishing Application Requirements Application Management The following subsections provide guidelines for: • • • • • Establishing operations-oriented application requirements Managing batch applications Managing online transaction-processing applications Managing client/server applications Using Tandem tools for application management Use these guidelines to plan for application management; to establish schedules, priorities, and job assignments; to determine staffing and training needs; and to pre
Requirements Application Management • Events and operator messages. The application should use the Event Management Service (EMS) to format events and messages in a standard fashion. Make sure that the application provides the information you need. For example: • • • • • • • • • • Make sure that an event or message is generated whenever a problem occurs.
Check List for an Applications Review Application Management • • • • Performance measurement. Performance-measurement tools or counters should be built into the applications. Determine what types of counters you need and what types of procedures are required. Application-specific operations guides. Require documentation for all applications. Your staff needs information on monitoring, installing, starting, and stopping applications, and on resolving simple problems.
Check List for an Applications Review Application Management • • • • • • • How fault-tolerant is the application? Can it handle minor problems? Does the application have recovery procedures for each phase of processing? What are the backup procedures? How often should NonStop TM/MP dumps be performed? What provisions must be made for off-site disaster recovery? Are additional tools or utilities (programs, command files, TACL routines) needed before the application can go into production? Will the opera
Establishing a Production-Assurance Control Group Application Management • • • Is the interface to application security easy to use? What are the hardware requirements? • • • What are the disk space requirements and are they reasonable? Has your organization’s capacity-planning and application-sizing staff analyzed the application for future hardware requirements? Should a Tandem analyst be contacted for help with sizing and reviewing the application? Have arrangements been made to acquire additiona
Batch, Online, and Client/Server Processing Application Management Batch, Online, and Client/Server Processing You might have to manage batch, online, and client/server processing applications. Batch, online, and client/server processing applications require different system management techniques. The following paragraphs describe the operations requirements for each type of application.
Online Transaction Processing Application Management Operations To perform batch processing on Tandem systems, the operations staff usually follows these steps: 1. They identify the batch job. 2. They submit the job input file to a scheduler program. Once the job is submitted, the scheduler does the rest of the work. The scheduler performs the following steps: 1. It schedules the job according to the scheduling options. 2. When the time comes to run the job, the scheduler starts the executor program.
Online Transaction Processing Application Management • • • • • • • • ATM Order entry Credit authorization Prescription orders Stock exchange Manufacture automation Travel reservations Telephone company switches Figure 11-2 illustrates a typical OLTP application. Figure 11-2. Online Transaction Processing Tandem NonStop System TP Monitor Server Program Database 036 Operations When managing OLTP applications, your most important concerns should be to: • • Maintain a stable environment.
Client/Server Processing Application Management Tools Table 11-1 lists Tandem software products that can help you develop and manage online transaction-processing applications. For detailed descriptions of these products, refer to Section 14, “Operations Management Tools.” Table 11-1. Online Transaction-Processing Tools Product Function NonStop Transaction Services/MP (NonStop TS/MP) Provides the programs and operating environment required for developing and running OLTP applications.
Client/Server Processing Application Management Figure 11-3.
Client/Server Processing Application Management Figure 11-4.
Client/Server Processing Application Management If both the client and its server maintain a transaction log of information necessary for reestablishing a client session, client down time can be significantly reduced. After the client reboots and logs in, the server can reestablish the session, provide the status of client transactions, and continue processing transactions as necessary.
Case Study Application Management Table 11-2. Client/Server Processing Tools (page 2 of 2) Product Function Remote Server Call (RSC) Facilitates client/server computing by allowing workstation applications running in Microsoft Windows, Windows NT, MSDOS, OS/2, UNIX, Winsock, and Apple Macintosh operating environments to access Pathway server classes and Guardian processes.
Analysis of Problem Application Management In the new centralized environment, operations support is provided by the data center personnel. Application users now have less flexibility and control but no longer have the responsibility of staffing operations. The operators at each central site are now required to know the application and all the peculiarities of its environment, such as the executor, scheduler, and change control procedures. Predictably, this raises a new set of problems.
Implementation of Recommendations Application Management Implementation of Recommendations While each of these problems can be easily fixed, the larger issue is a lack of standards. NASL had no standards for managing the new applications at each of the data centers. The standards that had been developed previously by each application department in the distributed environment were no longer effective in a centralized environment. Application standards and requirements needed to be developed.
Implementation of Recommendations Application Management At NASL, the application requirements specifically address the following design requirements: • All application input data must be verified. The application developers are now required to design applications as follows: 1. Accept all inputs. 2. Echo them back. 3. Ask the operator to verify data. If the operator confirms the data, then proceed. If not, allow the operator to reenter the data. • All applications must run normally without errors.
Check List Application Management At NASL, the key to the database is the system date and application ID. For example, if the system date is February 28, 1994, and it is before 5:00 p.m, the transaction date is February 28, the next business date is March 1, and the second business date is March 2. The calendar accounts for all weekends, holidays, and partial holidays (where any one location is open for business even though others are not). The PA group is responsible for maintaining the database file.
12 Automating and Centralizing Operations Overview Automating and centralizing operations can help you improve the efficiency and effectiveness of your system operations support staff and help improve system and application availability. This section lists the steps required for automating and centralizing Tandem system operations and describes the products that help you automate and centralize.
Why Automate and Centralize Operations? Automating and Centralizing Operations Figure 12-1.
Why Automate and Centralize Operations? Automating and Centralizing Operations Centralizing operations is the process of managing distributed systems, distributed applications, or a whole network from a single site. Centralizing operations: • • Allows fewer expert operators to manage a greater number of systems Allows you to leave some systems unattended or supported by only minimal staff Typically, a central site serves as a service organization to all other sites.
Automating and Centralizing Operations Automating Operations Tasks Automating Operations Tasks Automating operations tasks involves the following steps: 1. Commit resources to system automation. Tandem provides the automation tools, but the operations staff need to use the tools to automate system tasks. You might also have to train your staff so that they can use the automation tools. 2. Determine which tasks should be automated. Select the tasks that will increase staff productivity.
Automating and Centralizing Operations Centralizing System Operations 4. Have your intermediate-level or senior-level support personnel develop and test automated procedures. Make sure that every online file contains comments that explain what the file does and what each command in the file does. Note. When developing automation procedures, be sure to follow the standards or policies that have been implemented by your operations organization or established by your servicelevel agreements. 5.
Automation and Centralization Tools Automating and Centralizing Operations 6. Develop and test problem recovery procedures. For example, if the communications lines between the central node and a remote node go down, the staff should know what steps to take to perform tasks on the remote node. 7. Document the procedures developed in Steps 5 and 6. 8. Train your staff in the procedures. Automation and Centralization Tools Tandem provides a number of tools to help your staff automate and centralize tasks.
Automating and Centralizing Operations Check List Check List The following check list summarizes the main points of this section: 1. Commit resources to system automation and centralization. Determine staffing needs. 2. Determine which tasks should be automated and centralized. 3. Determine which tools to use.
Automating and Centralizing Operations Introduction to NonStop Operations Management– 125507 12- 8 Check List
13 Operations Management and Continuous Improvement Overview An operations environment, even one that is performing well, should never remain static. In business, change is vital. Changes in market conditions, technology, business goals, and competition can affect how you manage your operations environment. Successful operations organizations continuously improve the capabilities and efficiency of their operations management processes and tools to adapt to these changes.
Operations Management and Continuous Improvement Implementing an Operations-Management Improvement Program Figure 13-1. Causes of System Outages Install Processes and Procedures Upgrades Moves 40 % Configuration 044 Implementing an Operations-Management Improvement Program Improving your operations environment is more than just selecting tools and products.
Operations Management and Continuous Improvement Using the Maturity Framework The improvement program will be most successful if: • • • • • The improvement goals are aligned with the service-level agreements of your organization. It is planned, staffed, and approved by senior management. Assign a sufficient priority to the project so that adequate resources will be assigned and significant actions will take place. The entire operations staff is involved. Improvements are made in small, tested steps.
Operations Management and Continuous Improvement Using the Maturity Framework Table 13-1 summarizes each of the five levels in the maturity framework. Table 13-1. The Maturity Framework (page 1 of 2) Maturity Level Characteristics Level 1 The operations environment is driven from crisis to crisis by unplanned priorities and unmanaged change. Operators perform tasks in an ad-hoc fashion. Tools are not well integrated with the process, and operators use the tools informally to solve problems.
Operations Management and Continuous Improvement Step 1—Assessing Your Environment Table 13-1. The Maturity Framework (page 2 of 2) Maturity Level Level 5 Characteristics The staff continuously measures, analyzes, and improves its operations management processes to optimize productivity and minimize the risk of down time. The operations staff can plan for and incorporate new procedures and technologies with little risk, because it has established methods for managing and improving processes.
Operations Management and Continuous Improvement • • • Step 3—Developing an Action List Automate recovery tasks currently performed by the operators, for routine (recurring) problems. Automate performance monitoring. Document all major system components and their configurations, and define the actions to be taken when problems occur.
Operations Management and Continuous Improvement • • • Case Study Are you achieving your goals? At regular intervals during your improvement program, reexamine your original goals. Are you still working towards achieving those goals, or have you deviated from them? Before continuing with your improvements, you might have to adjust your improvement program.
Operations Management and Continuous Improvement Problem Scenario Problem Scenario The complexity of NAC’s systems was growing rapidly. Managers in the MIS department had to ensure that each of the 10,000 objects was installed and configured correctly and ran efficiently. The business applications and the system generated more than 15 events (status, warning, and problem messages) per minute. However, most problems were reported by end users over the phone.
Operations Management and Continuous Improvement • • Implementing an Operations-Management Improvement Program TACL macros were used to monitor available disk space and processor processes. However, because the macros had limited functions, they had to be recoded each time there were system configuration changes. In addition, operators had to execute and analyze the macros manually. Often when a serious problem occurred, operators were unavailable to execute the macros.
Operations Management and Continuous Improvement Implementing an Operations-Management Improvement Program 3. Improve system visibility by monitoring critical objects. 4. Introduce automated problem recovery software. 5. Improve the efficiency of automation and other management processes by implementing process statistics. Step 4—Scheduling and Committing Resources Once the actions were defined, the improvement team could create a schedule and recruit resources.
Operations Management and Continuous Improvement • • • • Implementing an Operations-Management Improvement Program Selected the important messages for each subsystem, defined their severity, and documented the recovery steps. Produced a document that specified the critical events and described how operators should respond to them. Used the document to build a set of filters managed by the Event Management Service (EMS).
Operations Management and Continuous Improvement • Implementing an Operations-Management Improvement Program Provide a high-level view of the system that operators can easily interpret. OMF can represent many thousands of objects and their states on one screen. With a quick look at this screen, operators get an immediate impression of the health of the system they have to manage.
Operations Management and Continuous Improvement Conclusion Figure 13-3. Case Study: Manual Recoveries Versus Automated Recoveries 400 350 300 250 200 150 100 50 0 Nov Dec Manual Jan Feb Mar Apr May Jun Jul Automated 047 Step 6—Assessing the Improvements After completing their improvement program, the improvement team assessed their operations management processes and concluded that they were now at maturity level 3. The following paragraphs summarize the improvement team’s evaluations.
Operations Management and Continuous Improvement Check List Check List The following check list summarizes the steps involved for implementing an operationsmanagement improvement program: 1. Assess the current status of your operations management processes. Use the maturity framework to help you determine the maturity level of your operations environment. 2. Develop a vision of the operations management processes you want to have in place by establishing goals and objectives. 3.
14 Operations Management Tools Overview Tandem provides a wide variety of tools that help your staff perform operations tasks.
Overview Operations Management Tools FUP X Flow Map X GPA X Measure X NetBatch/ NetBatch-Plus X NSX X NonStop Access for Networking X X X X X X X X X X NonStop ODBC Sever X X NonStop SQL/MP X NonStop SQL/MP SQLCI X NonStop TM/MP X NonStop TM/MP TMFCOM/ TMFSERVE X NonStop TS/MP X NonStop VHS Automating and Centralizing Application Management Contingency Planning Security Management Performance Management Change and Configuration Management Tool Problem Management P
Overview Operations Management Tools NSKCOM X OMF X ONS Automating and Centralizing Application Management Contingency Planning Security Management Performance Management Change and Configuration Management Tool Problem Management Production Management Table 14-1.
Operations Management Tools $CMON $CMON $CMON is a user-written program that monitors some command-interpreter activities. You can use $CMON to secure your system by auditing and restricting attempts to: • • • • • Log on and log off Run a program Alter the priority of a process Add users to the system or delete users from the system Change a user’s logon password and remote passwords The International Tandem Users’ Group (ITUG) can supply you with a sample copy of $CMON.
Operations Management Tools Data Access Language (DAL) Server Table 14-2.
Operations Management Tools Distributed Name Service (DNS) You can specify any one of a number of reports as the output of the DSAP utility. Each report analyzes the disk in a different way, for example: • • • • The Subvol Summary report analyzes the space usage for each subvolume on a disk. The User Summary report analyzes the space usage for each user who owns files on the disk. The User Detail report lists the file name and space usage for each file on the disk.
Operations Management Tools Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) DSM/NonStop Operations for Windows (DSM/NOW) is a Microsoft Windows client/server operations console environment for NonStop systems. DSM/NOW increases the effectiveness of operations and system management by: • • • • • Allowing you to run multiple management applications such as SCF from a single workstation.
Operations Management Tools Distributed Systems Management/Software Configuration Manager (DSM/SCM) Distributed Systems Management/Software Configuration Manager (DSM/SCM) DSM/SCM is a tool for the centralized planning, management, and installation of software on distributed (target) Tandem NonStop systems. DSM/SCM running on a Tandem central (host) system receives, archives, configures, and packages software for target sites.
Operations Management Tools Enform Enform Enform is a query language service that generates reports. You can use Enform to generate reports from measurement data, including data collected by Measure. Event Management Service (EMS) EMS collects and consolidates event information generated by software subsystems and routes this information through the network. An event is any normal or abnormal change in the status of a device, line, or system on the network.
Operations Management Tools Flow Map Flow Map Flow Map is an application-process flow-diagram generator. Flow Map analyzes data collected by TPDC and Measure and creates a Microsoft Excel based graphical representation of the applications running on the system, and their performance on the system. You can create Flow Map diagrams to: • • • • Depict the application’s processes, files, and the connections between them. Monitor the actual flow of message traffic within an application.
Operations Management Tools Measure Measure Measure is a performance-measurement tool that lets technical specialists or operators collect and examine statistics for a system. It gives specialists or operators immediate, online access to performance statistics for key system and network components, including complex business applications. Specialists or operators can optimize online transaction-processing applications by using the statistics gathered by Measure.
Operations Management Tools NetBatch and NetBatch-Plus NetBatch and NetBatch-Plus NetBatch schedules and controls batch jobs as follows: • • The NetBatch scheduler automatically executes and monitors jobs, based on the specified parameters. Operators can specify the times jobs should run, submit the jobs, and then let the scheduler start the jobs at the right time and send the output to the correct location. NetBatch jobs can be run anywhere in an Expand network.
Operations Management Tools NonStop SQL/MP SQLCI NonStop SQL/MP SQLCI SQLCI is the primary interface through which database administrators create and change structures to manage data. SQLCI provides SQL data description language (DDL) statements to define the database, SQL data manipulation language (DML) statements to query and modify database tables, installation commands to install NonStop SQL/MP, a set of database-management utilities, and a report-writer facility.
Operations Management Tools NonStop TM/MP Interfaces (TMFCOM, TMFSERVE) transactions and database consistency, making these operations transparent to users and application programmers.
Operations Management Tools • • NonStop TS/MP PATHCOM Interface The Parallel Transaction Processing (PTP) environment, which runs in the Guardian operating environment, supports the CICS command-level API, and enables CICS applications to run on Tandem NonStop systems and to communicate with other CICS applications.
Operations Management Tools NonStop Virtual Hometerm Subsystem (VHS) NonStop Virtual Hometerm Subsystem (VHS) VHS acts as a virtual home terminal for applications by emulating a 6530 terminal. VHS receives messages normally sent to the home terminal, such as displays and application prompts, and uses these messages to generate event messages to EMS to inform operations staff of problems.
Operations Management Tools Open Notification Service (ONS) When combined with Network Statistics Extended (NSX), OMF can provide a networkwide view of both the performance and status of objects. Open Notification Service (ONS) ONS is a set of processes, files, and management information bases (MIBs) that work together to enable Tandem subsystems to be monitored by network management applications that comply with the Simple Network Management Protocol (SNMP).
Operations Management Tools PEEK Pathway/TS includes the terminal control process (TCP) and the SCREEN COBOL compiler and run-time environment: • • • A terminal control process (TCP) interprets and executes SCREEN COBOL programs and, with the help of the PATHMON process, coordinates communication between those programs, their terminals, and server processes.
Operations Management Tools • Simple Network Management Protocol (SNMP) Write scripts to customize the interface. For example, you can control window placement on the screen, assign each window to a process, decide what text is sent to a process, and determine how the output from a process is displayed.
Operations Management Tools Tandem Advanced Command Language (TACL) Tandem Advanced Command Language (TACL) TACL is the command interpreter for the NonStop Kernel operating system. TACL helps you automate operations tasks by allowing you to write macros to perform commands. A macro is a stored sequence of TACL commands to which you assign a name; entering the macro name invokes the command sequence. Macros can accept arguments.
Operations Management Tools • • Tandem Failure Data System (TFDS) Generate charts to aid in determining performance management alternatives and recommendations Archive and organize historical transaction-oriented data Tandem Failure Data System (TFDS) TFDS is an automated diagnostic-and-recovery tool that monitors Tandem processors and automatically initiates a processor dump in the event of a failure, analyzes the failure data, and initiates recovery based on the type of defect discovered.
Operations Management Tools Tandem Reload Analyzer (Reload Analyzer) Tandem Reload Analyzer (Reload Analyzer) Reload Analyzer is a database-management tool that helps you identify fragmented keysequenced Enscribe files or key-sequenced NonStop SQL/MP objects. Reload Analyzer makes recommendations based on its calculations and provides general, block, and data chain information to help you decide whether to reorganize a key-sequenced Enscribe file or NonStop object.
Operations Management Tools • • TSM EMS Event Viewer Use TMFCOM, the command interface to NonStop TM/MP, to take online dumps of the Transfer database files and to recover the database files after a system failure. Establish a regular schedule for online dumps of Transfer database files. Monitor disk space for Transfer and applications based on Transfer, such as PS MAIL. These applications have a tendency to use large amounts of disk space very quickly.
Operations Management Tools Introduction to NonStop Operations Management– 125507 14 -24 ViewSys
A Additional Reading Overview This appendix provides suggestions for additional reading for each section of this book. This information will help you learn how to perform specific tasks, use Tandem products, and gain a better understanding of Tandem systems in general.
Additional Reading Section 3—The Operations and Support Areas Section 3—The Operations and Support Areas For site planning, configuration, and installation of the NonStop Himalaya S-series servers, refer to: • • • • • • • Guardian System Operations Guide Guardian System Operations Reference Guide Himalaya S-Series Installation Guide Himalaya S-Series Operations Guide Himalaya S-Series Planning and Configuration Guide Himalaya S-Series Server Description Manual Himalaya S-Series Workstation Installation
Section 5—Production Management Additional Reading Section 5—Production Management For information on system startup, memory dumps, processor reload, and system shutdown, refer to the appropriate system-specific manual: • • • • Himalaya S-Series Operations Guide Himalaya S-Series Support Guide Processor Halt Codes Manual Tandem Failure Data System (TFDS) Manual For information on the other tasks and products mentioned in Section 5, refer to the system-specific manuals and the following: • • • • • • •
Section 6—Problem Management Additional Reading Section 6—Problem Management For comprehensive information on problem management, refer to the Availability Guide for Problem Management. For information about Distributed Systems Management, refer to the Introduction to Distributed Systems Management (DSM). For information about Distributed Systems Management/NonStop Operations for Windows, refer to the DSM/NonStop Operations for Windows (DSM/NOW) Manual.
Additional Reading Section 7—Change and Configuration Management Section 7—Change and Configuration Management For comprehensive information on change and configuration management, refer to the Availability Guide for Change Management.
Section 9—Security Management Additional Reading Section 9—Security Management For information on security in general, refer to the Security Management Guide. For information on Safeguard, refer to: • • • Safeguard Reference Manual Safeguard User’s Guide Safeguard Administrator’s Manual For information on NonStop SQL/MP, refer to the NonStop SQL Installation and Management Manual. For information on the Tandem NonStop Kernel operating system security features, refer to the Guardian User’s Guide.
Additional Reading Section 11—Application Management Section 11—Application Management For information on Tandem application software, refer to: • • • • • • • • • • • • • • • • • • • • • Data Access Language (DAL) Server Manual NetBatch Manual NetBatch-Plus Manual Introduction to NonStop TM/MP Introduction to NonStop Transaction Processing Introduction to Transfer Delivery System NonStop Access for Networking System Note NonStop TM/MP Planning and Administration Guide NonStop TM/MP Operations and Recove
Additional Reading • • • • • Section 13—Operations Management and Continuous Improvement SPI Programming Manual TACL Programming Guide Tandem Failure Data System (TFDS) Manual Tandem Network Statistics Extended (NSX) Manual Tandem Object Monitoring Facility (OMF) Manual Section 13—Operations Management and Continuous Improvement This section describes an approach to improving operations-management processes. No tools or products are mentioned in this section.
B Check Lists Overview The check lists from each section in this book are reproduced here so that you can easily use the check lists for note taking or photocopying. The Operations Staff 1. Structure your organization so that it most effectively and efficiently provides the entry-level through senior-level operations, planning, control, and support activities your company needs. 2. Define each person’s job duties. Make sure that there is a well-defined path for problem escalation and for career growth.
The Operations and Support Areas Check Lists The Operations and Support Areas 1. Determine the type of environment your systems require. 2. Select the location for your system. The location should: • • • • Be the safest and most secure available to you Provide all system and environmental requirements Have enough space for all equipment and for storage areas Have all required data communications lines and telephones 3.
Operations Documentation Check Lists Operations Documentation 1. Determine what type of documentation you need in order to run your organization efficiently.
Production Management Check Lists Production Management 1. Implement operator tools to monitor the systems, networks, applications, processors, disks, and communications lines. 2. Determine which tools to use. Tandem offers these tools: • • • • • • • • • • • DSM/NOW EMS EMSA NetBatch and NetBatch-Plus NSX OMF SeeView Tandem Service Management (TSM) TSM EMS Event Viewer VHS ViewSys 3.
Problem Management Check Lists Problem Management 1. Maintain a well-trained operations and support staff. 2. Establish problem prevention strategies.
Problem Management Check Lists • • • Update the problem report form whenever a problem is escalated. Know which person on each shift is the Tandem contact. The Tandem contact should understand when and how to contact Tandem. Know how to take processor memory dumps and obtain copies of system log files. 7.
Change and Configuration Management Check Lists Change and Configuration Management 1. Obtain management commitment to developing policies and procedures, training staff in the policies and procedures, and enforcing policies and procedures. 2. Determine your staffing needs. 3. Anticipate and plan for change by: • • • Evaluating system performance and growth to accommodate change. Providing adequate computer room resources to allow for growth and avoid unnecessary down time.
Performance Management Check Lists Performance Management 1. Establish service-level agreements. 2. Assign staff to the capacity-planning, application-sizing, and performance-analysis functions. Provide training as needed. 3. Establish procedures for application sizing. Typically, the application-sizing staff: a. Establishes sizing requirements and strategy b. Forecasts and develops models of future demands c. Reports results 4. Establish procedures for capacity planning.
Security Management Check Lists Security Management 1. Develop a security policy for your organization. 2. Educate the user community and the operations staff about security and their responsibilities for protecting the system. 3. Designate a security administrator and a security administration team to manage security. Set up check lists for the administrator and team members. 4. Maintain physical security: • • • • Limit access to the computer room (if applicable).
Security Management Check Lists • • • • • Reserve a range of group numbers for network user IDs, and assign network user IDs from these groups. Decide on the network-wide names for the groups on an as-needed basis. Designate a particular organization to own each group name and group ID, and make that organization responsible for controlling the allocation of user IDs within its group. Determine what applications and users can use network IDs. Consider using encryption devices.
Contingency Planning Check Lists Contingency Planning 1. Take preventive steps to limit risks of disaster: a. Select the best site possible for your organization: • • Should the site should be located at a remote, computer-only site, or with other business operations? Is the site away from known danger zones? b. Select or design the best facility possible. Follow the guidelines in Section 3, “The Operations and Support Areas.” c.
Contingency Planning Check Lists b.
Application Management Check Lists Application Management 1. Establish operations requirements for all applications. 2. Participate in application reviews to ensure that your requirements are met. 3. Establish a production-assurance control group to ensure that applications are run with the correct data input or options. 4. Establish procedures for managing batch, online, and client/server processing applications. 5. Establish procedures for using Tandem products to manage the applications. 6.
Automating and Centralizing Operations Check Lists Automating and Centralizing Operations 1. Commit resources to system automation and centralization. Determine staffing needs. 2. Determine which tasks should be automated and centralized. 3. Determine which tools to use.
Check Lists Operations Management and Continuous Improvement Operations Management and Continuous Improvement 1. Assess the current status of your operations-management processes. Use the maturity framework to help you determine the maturity level of your operations environment. 2. Develop a vision of the operations-management processes you want to have in place by establishing goals and objectives. 3. Develop an action list of tasks and the sequence in which to implement them. 4.
Check Lists Operations Management and Continuous Improvement Introduction to NonStop Operations Management– 125507 B -16
Glossary access-control list (ACL). A Safeguard facility that allows you to restrict access to system objects. ACL. See access-control list (ACL). alias. When using TACL, an alias is a name that stands for a command. When using DNS, an alias is a name that stands for a network component. Aliases simplify operations tasks. When using Safeguard, an alias is an alternate name that can be assigned to a user for purposes of logging on to the system. Alliance program.
callback routine. Glossary callback routine. A routine that allows the system to authenticate a caller’s telephone location before permitting the caller to access the system. catalog. A NonStop TM/MP database that contains information about audit dumps, online dumps, and tape volumes. change control. A systematic approach to controlling the introduction of change in the production environment. change management.
configuration management. Glossary configuration management. The process of configuring the production system hardware and software to adapt to changes. One of the operations disciplines in the operations management model. See operations management model. CONFLIST. A listing that contains all SYSGEN or SYSGENR commands and responses, including error and warning messages, that occur during processing of the configuration files and building of the new operating system.
Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Glossary Distributed Systems Management/NonStop Operations for Windows (DSM/NOW). A Microsoft Windows client/server application that simplifies NonStop server management through a graphical user interface. DSM/NOW consists of an application launcher, event viewer, and integrated command and control functions. Distributed Systems Management/Software Configuration Manager (DSM/SCM).
environmental outage class. Glossary environmental outage class. An outage class that includes failures in power, cooling, network connections, natural disasters (earthquake, flood), terrorism, and accidents. See also outage class. event. A change in some condition in the system or network, whether minor or serious. Events might be operational errors, notifications of limits exceeded, requests for action, and so on. Event Management Service (EMS).
hot site. Glossary hot site. See operational-ready site. ICC. See Integrated Command and Control (ICC). Integrated Command and Control (ICC). A Microsoft Windows workstation component of DSM/NOW. ICC provides a point-and-click object browser for Tandem host subsystem objects and maps related commands from various subsystem command interfaces to clickable buttons. See also Distributed Systems Management/NonStop Operations for Windows (DSM/NOW). International Tandem Users’ Group (ITUG).
MeasTCM. Glossary MeasTCM. The interface between Measure on the Tandem host and the capacity planning tool TCM on the PC or the Macintosh. MeasTCM runs under TACL on the Tandem host, summarizes the performance data collected by Measure, and formats this data for use by TCM. Measure. A performance-measurement tool that lets users collect and examine statistics for a system or network. MultiLan.
NSKCOM Glossary NSKCOM. The command interface to the Kernel-Managed Swap Facility (KMSF). NSKCOM is the primary tool for monitoring, configuring, and managing kernel-managed swap files. See also Kernel-Managed Swap Facility (KMSF). NSX. See Tandem Network Statistics Extended (NSX). offline. Used to describe tasks that can be performed only when the system is down. Contrast with online. OM model. See operations management model. OMF. See Tandem Object Monitoring Facility (OMF). online.
operations management activities. Glossary operations management activities. Activities, as defined by the Tandem operations management model, that support a production system, plan for all aspects of the production system, control the introduction of change into the production system, and operate the production system. operations management model (OM model).
Pathway Open Environment Toolkit (POET). Glossary Pathway Open Environment Toolkit (POET). A set of programs and utilities that assist in the creation and running of client/server applications for Tandem systems. Pathway/TS. A Tandem product that provides tools for developing and interpreting screen programs to support OLTP applications in the Guardian operating environment. See also NonStop Transaction Services/MP (NonStop TS/MP) and Pathway environment. PEEK.
remote mirroring Glossary remote mirroring. A pair of mirrored disk drives that are used together as a single logical drive in which the primary drive and the backup (mirror) drive are located in geographically distinct (remote) locations. Each byte of data written to the primary drive is also written to the mirror drive. If the primary drive fails, the mirror drive can continue operations.
ServerNet System Area Network (SAN) Glossary ServerNet System Area Network (SAN). A wormhole-routed, full-duplex, packet-switched, point-to-point network designed with special attention to reducing latency and ensuring reliability. The ServerNet SAN provides the communication path used for interprocessor messages and for communication between processors and I/O devices. ServerNet Wide Area Network (SWAN) concentrator.
Subsystem Control Point (SCP). Glossary Subsystem Control Point (SCP). The management process for all Tandem data communications subsystems. There can be several instances of this process. Applications using the Subsystem Programmatic Interface (SPI) send all commands for data communications subsystems to an instance of this process, which in turn sends the commands on to the manager processes of the target subsystems. SCP also processes a few commands itself.
Tandem Network Statistics Extended (NSX). Glossary documents that are available on local CD-ROM discs as well as online, Internetaccessible servers. This common interface allows you to merge local and online searches and display local and online windows. Tandem Network Statistics Extended (NSX). A network management tool that provides operators with a global perspective on the entire network.
TIM Glossary TIM. See Tandem Information Manager. TMFCOM. The NonStop TM/MP command interpreter. TPDC. See Tandem Performance Data Collector (TPDC). Transfer. An information delivery system that enables organizations to move and manage information efficiently within a single Tandem system or a network of distributed systems. TSM. See Tandem Service Management TSM EMS Event Viewer. Used to perform a variety of tasks associated with viewing and monitoring EMS event logs.
$CMON.
Index A Access control list (ACL) 9-12 Account Quality Planning (AQP) Service 1-18 ACL 9-12 Activity areas, staffing 2-1 Additional reading A-1 Agreements documentation of 4-4 service-level 1-2, 4-2, 8-2 Alias, user 9-12 Alliance program 1-16 Application management case study 11-14 check list 11-18 description of 11-1 example 11-14 operations tools 11-8, 11-10, 11-13 Application sizing 8-1 Applications batch 11-7/11-8 client/server 11-10/11-13 configuration listings 4-9 controlling 11-2, 11-6 managing 11-1/
D Index Change management (continued) responsibilities of management 7-3 staffing requirements 7-3 Change, planning for 7-4 Check lists, summary of B-1 Client/server processing description 11-10/11-13 security requirements 9-19 Cold sites 10-10 Collusion 9-2, 9-5 Command files 12-1, 14-4 Command post for disaster recovery 10-7 Communications, performing configuration changes 7-6 Computer cabinets, security of 9-10 Computer-room description of 3-1 monitoring 3-4 planning for equipment and supplies 3-5 plan
E Index DNS description 14-6 naming conventions 11-2 Documentation additional operations information A-1 CE logs 4-13 configuration diagrams and listings 4-5 contracts 4-4 error logs 4-13 error message 4-17 flow diagrams 4-9 manuals 4-12 online files 4-17 operator guides 4-16 operator logs 4-13 outage logs 4-14 policies and procedures 4-1 service-level agreements 4-2 software release documents 4-12 Downtime, high cost of 1-8 DSAP 9-8, 9-22, 9-23, 14-5 DSM 1-12, 6-4 DSM/NOW 14-7 DSM/SCM 7-5 configuration r
I Index Hardware (continued) implementing physical changes 7-4 installing 3-6/3-7 securing 9-9/9-11 support for 1-16 Help desks equipment and supplies 3-9 uses of 6-8 Help-desk operators 2-17, 6-8 High-security facilities 3-2 I Improvement program 13-2/13-7 Internal operator guides 4-16 International Tandem User’s Group (ITUG) See ITUG ITUG 1-17, 9-9 J Job descriptions computer-room operator 2-15 help-desk operator 2-17 lead operator 2-18 operations manager 2-24 senior configuration planner 2-23 senior
N Index N Naming conventions 11-2 NetBatch description 14-12 operations tasks 11-8 used to manage batch processing 11-8 NetBatch-Plus description 14-12 operations tasks 11-8 used to manage batch processing 11-8 Network Statistics Extended (NSX) See NSX Networks checking status of 12-3 diagrams of 4-5 encrypting data 9-19 monitoring 5-11, 12-1 security default settings 9-8 NonStop Access for Networking 10-3, 14-12 NonStop Kernel network ID requirements 9-18 security features 9-6, 9-8 NonStop ODBC Server 14
O Index On-site storage, security of 9-10 Open Notification Service (ONS) See ONS Open System Services (OSS) See OSS Operational-ready sites 10-10 Operations agreements 1-2 documentation 4-1 improvement program 13-2/13-7 improving processes 13-1 manuals 1-13 staffing requirements 2-4/2-6 Operations areas equipment and supplies for 3-5 monitoring the environment 3-4 preventive maintenance 3-7 security of 3-5 selecting a location for 3-1 Operations management activities 2-1 and fault-tolerance 1-11 descript
P Index Operators computer-room operator 2-15 help-desk operator 2-17 lead operator 2-18 Optimizing system performance 8-8 OSS file security 9-20 interoperability with Safeguard 9-20 running backups 5-13 security considerations 9-7 user aliases 9-12 Outage logs 4-14 Outages classes 6-1 in a client/server environment 1-9 measuring 1-8 planned 1-9/1-11 prevention and recovery training 6-3 reducing 1-10 unplanned 1-9/1-11, 6-1 Outage-minutes-per-year measurements 1-9 P Passwords 9-16/9-18 PATHCOM 5-11, 14-1
Q Index Production schedule 5-5/5-6 PROGID programs 9-22 Programs development 9-21 licensed 9-23 PROGID 9-22 Q Quality services provided by Tandem See Account Quality Planning (AQP) Service R Recovery procedures 5-15 Reload Analyzer (Tandem Reload Analyzer) 14-22 Remote Server Call (RSC) 14-18 Reporting and tracking, reviewing procedures 5-15 Reports of network data 5-10 of security audits 5-14 of statistical information 5-5 of system performance 5-5, 5-13, 5-14 RSC 14-18 S Safeguard description of 9-8
T Index Service-level agreements creation of 4-2 description of 1-2, 4-2 use of 8-2 Shutdown request forms 5-9 Simple Network Management Protocol (SNMP) 14-19 Site planning 3-1, 3-2 Site update tape (SUT) 4-12 SNMP 14-19 Software education 1-15, 8-3, 8-4 installing changes to 7-6 publications 1-13 release documents 4-12 support for 1-16 Sources of additional reading A-1 Special user IDs 9-12 SPI 14-19 SPOOLCOM 5-11 Spooler, monitoring 5-11, 5-12 Staffing check list 2-27 levels 2-1/2-3 operations managemen
T Index Tandem Alliance program 1-16 education 2-26 manuals 1-13, 2-26 software 1-12 support services 1-16 systems 1-11 World Wide Web home page 1-15 Tandem Advanced Command Language (TACL) 14-20 Tandem Capacity Model (TCM) 14-20 Tandem Failure Data System (TFDS) 14-21 Tandem Network Statistics Extended (NSX) See NSX Tandem NonStop Support Center (TNSC) 6-11 Tandem Performance Data Collector (TPDC) 14-21 Tandem Reload Analyzer (Reload Analyzer) 14-22 Tandem Service Management See TSM Tape drives, maintena
U Index TSM EMS Event Viewer (continued) product description 14-23 U Unplanned outages 1-9/1-11, 6-1 User aliases 9-12 User classes See User IDs User groups 9-11, 9-12 User IDs adding 9-12 deleting 9-15 expiration dates 9-15 freezing 9-15 group manager 9-14 guest-user ID 9-15 network application IDs 9-18 network IDs 9-18 purpose of 9-11 reusing 9-16 special classes of 9-12 super ID 9-13 super-group user 9-14 V ViewSys 14-23 Voice alert systems 3-5 W World Wide Web, Tandem home page 1-15 Special Charac
Special Character Index Introduction to NonStop Operations Management– 125507 Index -12