Availability Guide for Application Design Abstract This guide provides an overview of application availability options that designers and developers can use to support both the business function of the application and the instrumentation of the application. Product Version N.A. Supported Release Version Updates (RVUs) This guide supports D48.00 and all subsequent D-series releases, G06.04 and all subsequent G-series releases, and H06.
Document History Part Number Product Version Published 131117 N.A. December 1996 124511 N.A. March 1999 525637-002 N.A. May 2003 525637-003 N.A. May 2005 525637-004 N.A.
Availability Guide for Application Design Glossary Index What’s New in This Guide xi Guide Information xi New and Changed Information Figures Tables xi About This Guide xiii How Is This Guide Organized? xiv Who Should Read This Introduction? xv Where Else Can You Find Related Information? Your Comments Invited xx Notation Conventions xx xv 1.
Contents 2. Overview of Server and Network Fault Tolerance 1.
4. Data Protection and Recovery Contents Improving Availability on the Internet (continued) Standards and Web Services 3-9 The Pathway Application Server 3-10 Web Server support 3-11 The iTP Secure WebServer 3-11 The iTP WebServer Architecture 3-13 The WebLogic Server 3-18 Server-Specific Features 3-18 Specific Enhancements 3-19 4.
Contents 5. Increasing the Availability of Tuxedo Applications 4. Data Protection and Recovery (continued) The Problem 4-24 Possible Solutions 4-24 Solution Using NonStop Operating System in Support of NetBatch-Plus Software 4-25 Solution Using Low-Priority Transaction Processing to Perform the Batch Function 4-26 Solution Using a Database Snapshot 4-29 5.
Contents 7. Instrumenting an Application for Availability 6. Availability in the Pathway Transaction-Processing Environment (continued) Summary and Comparison of Application Components 6-28 7.
9. Minimizing Programming Errors Contents 8.
Glossary Contents 10.
Figures Contents Figure 3-1. Figure 3-2. Figure 3-3. Figure 4-1. Figure 4-2. Figure 4-3. Figure 4-4. Figure 4-5. Figure 4-6. Figure 4-7. Figure 4-8. Figure 4-9. Figure 4-10. Figure 5-1. Figure 5-2. Figure 5-3. Figure 6-1. Figure 6-2. Figure 6-3. Figure 6-4. Figure 6-5. Figure 6-6. Figure 7-1. Figure 8-1. Figure 8-2. Figure 8-3. Figure 8-4. Figure 8-5. Figure 8-6. Figure 8-7. Figure 8-8. Figure 8-9. Figure 9-1. Figure 10-1.
Tables Contents Figure 10-2. Figure 10-3. Figure 10-4. Figure 10-5. Figure 10-6. Figure 10-7. Indirect Addressing Allows Multiple Versions of the Data-Interface Module to Be Used Simultaneously 10-17 Labeled Message Versions 10-20 Overlapping Ranges of Labeled Message Versions 10-21 Increasing Application Availability Through Dynamic Name Resolution 10-34 Increasing Application Availability During SQL Program Upgrade 10-36 Increasing Application Availability During DDL Changes 10-38 Tables Table i.
Contents Availability Guide for Application Design— 525637-004 x
What’s New in This Guide Guide Information Availability Guide for Application Design Abstract This guide provides an overview of application availability options that designers and developers can use to support both the business function of the application and the instrumentation of the application. Product Version N.A. Supported Release Version Updates (RVUs) This guide supports D48.00 and all subsequent D-series releases, G06.04 and all subsequent G-series releases, and H06.
What’s New in This Guide • New and Changed Information Numerous other minor changes to text and figures have been made.
About This Guide The Availability Guide for Application Design provides an overview of application availability options available to designers and developers. The described options support both the business function of the application and the instrumentation of the application. The guide tells you how to design as much availability as you need into your application.
About This Guide How Is This Guide Organized? How Is This Guide Organized? This guide progresses logically through the products and concepts involved in application availability. Readers who read it from cover to cover will gain the most benefit. Table i shows the organization of the guide. Table i. Organization and Contents of the Manual Section or Appendix... Describes...
Who Should Read This Introduction? About This Guide Table i. Organization and Contents of the Manual Section or Appendix... Describes... 9 Minimizing Programming Errors Techniques that help prevent application downtime by reducing coding errors. 10 Designing Applications for Change Features that you can build into your application that will make it easier to upgrade your application at a later date with the minimum effect on application availability.
About This Guide Useful Background Information Useful Background Information The following manuals provides a basic introduction to the major features of HP NonStop systems and are useful starting points for readers new to the products.
Useful Background Information About This Guide Figure i.
About This Guide Other Availability Guides Other Availability Guides The Availability Guide for Change Management explains how to maximize system and application availability while successfully implementing changes to your NonStop system. The Availability Guide for Problem Management helps you to maximize system and application availability by predicting, preventing, and preparing for problems.
Instrumentation Programming Manuals About This Guide Instrumentation Programming Manuals EMS Manual describes the Event Management Service (EMS). EMS is a collection of processes, tools, and interfaces that provide event message collection and distribution in the distributed systems environment.
About This Guide Design Help Service Design Help Service HP’s Availability Review Service is available to help with all availability issues. Contact your HP representative for details. Your Comments Invited After using this manual, please take a moment to send us your comments. You can do this by: • • • Completing the online Contact NonStop Publications form if you have Internet access.
General Syntax Notation About This Guide UPPERCASE LETTERS. Uppercase letters indicate keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. For example: MAXATTACH lowercase italic letters. Lowercase italic letters indicate variable items that you supply. Items not enclosed in brackets are required. For example: file-name computer type. Computer type letters within text indicate C and Open System Services (OSS) keywords and reserved words.
About This Guide General Syntax Notation … Ellipsis. An ellipsis immediately following a pair of brackets or braces indicates that you can repeat the enclosed sequence of syntax items any number of times. For example: M address [ , new-value ]… [ - ] {0|1|2|3|4|5|6|7|8|9}… An ellipsis immediately following a single syntax item indicates that you can repeat that syntax item any number of times. For example: "s-char…" Punctuation.
Notation for Messages About This Guide !i:i. In procedure calls, the !i:i notation follows an input string parameter that has a corresponding parameter specifying the length of the string in bytes. For example: error := FILENAME_COMPARE_ ( filename1:length , filename2:length ) ; !i:i !i:i !o:i. In procedure calls, the !o:i notation follows an output buffer parameter that has a corresponding input parameter specifying the maximum length of the output buffer in bytes.
About This Guide Change Bar Notation either vertically, with aligned braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example: obj-type obj-name state changed to state, caused by { Object | Operator | Service } process-name State changed from old-objstate to objstate { Operator Request. } { Unknown. } | Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces.
1 What Is Application Availability? Application availability means that the application is always available to the end user. While the system manager for a large computer system—such as an enterprise server—might consider availability only in terms of keeping the server up and running, the end user might not care that the server has not been down in many months if the network that connects the end user to the server is constantly failing.
What Is Application Availability? Why Is Availability Important? Why Is Availability Important? Demands for continuous services are increasing in almost all markets. To support these demands, it is becoming increasingly important that computer applications are always available.
What Is Application Availability? Cost Containment businesses also need to be able to check customer credit references before charging that meal at 3 a.m. Corporate Globalization In today’s global economy, most large corporations operate in many countries around the world—and this trend is increasing. The result is that it is always 9 a.m. at a corporate facility somewhere.
What Is Application Availability? Cost Containment Revenue Loss Revenue loss is perhaps the most obvious cost of an outage. Application failure means that the stock brokerage can handle fewer transactions that day, the mail order company cannot process orders, and the manufacturing plant might have to shut down its production altogether. Any organization that depends on its computer system for its revenue-generating functions will suffer a loss of revenue in the event of an outage.
What Is Application Availability? Cost Reduction Cost Reduction In addition to preventing additional costs incurred if the application goes offline, applications with a higher level of availability also decrease the day-to-day operational costs of the application. Fewer calls for help could mean that fewer operators are needed to support the application.
What Is Application Availability? Measuring Downtime in Minutes it does break, it takes no time at all to fix it. The computer industry traditionally uses a percentage to represent this value. For example, suppose that over a period of 10,000 minutes, an application has one outage that takes 100 minutes to repair: Uptime = 9,900 minutes Repair time = 100 minutes Availability = 9,900/(9,900 + 100), or 99%.
What Is Application Availability? Alternative Ways to Measure Downtime of the application might affect only one user, but to that user the application is down. A failure in part of the network could affect several users. A failure in the server, however, could affect hundreds of users. It is therefore important that an outage in the server be weighted over an outage in the client. By expressing downtime in terms of user outage minutes, a one-minute outage in the client equals one minute of downtime.
What Is Application Availability? Outage Classes classes and brings to your attention the need to consider outages in parts of the network in addition to the server. Outage Classes Outages fall into the following classes: • • • • • Physical Design Operations Environment Reconfiguration The first four classes listed above are recognized throughout the computer industry. Reconfiguration is an outage class added by HP; other computer vendors call this phenomenon scheduled downtime.
What Is Application Availability? Outage Classes Design Outages Design outages are usually caused by malfunctioning software, either system software or application. Again, deterministic faults are rare throughout the industry. Transient problems are more common as users of most personal computers will testify. Potential causes of a design outage on a NonStop system include a LAN network broadcast storm or a degenerating response time.
What Is Application Availability? Outages Are Not Limited to the Server Reconfiguration Outages Reconfiguration outages are those times that the application is unavailable because of scheduled downtime.
What Is Application Availability? System-Level Components availability to ensure that its users can depend on its services. This concept is equally true for the application as it is for the hardware, system software, and middleware. Figure 1-1.
What Is Application Availability? Support For Dynamic Linked Libraries against either disk. For random reads, access is faster because the nearest head to the data is used. For sequential reads, access is faster because the heads alternate access. Because, in a typical system, read operations outnumber write operations by about 10 to 1, these features improve performance significantly while enhancing availability.
What Is Application Availability? • • • • • • Support For Dynamic Linked Libraries A DLL can appear in different virtual addresses in different processes. Multiple processes can use different versions of the same DLL simultaneously. The same program can run simultaneously in different processes with different DLLs. A new version of a DLL can be introduced without altering a program or DLL that references it. In the case of an SRL the same entry points in the same order would be required.
What Is Application Availability? Transaction Support in the Server System the incompatibility of C and C++ heap operations with passive checkpoint. There is no way to cause the passive backup to capture heap allocations and deallocations, whether explicit via malloc() or new, or implicit in any number of functions. Because heap operations are often pervasive in C and C++ programs, the incompatibility is often exaggerated slightly to “C and C++ do not support passive checkpoint.
What Is Application Availability? • • Availability and Application Design The RSC/MP product each run partly on the workstation and partly on the server. They enable client applications written in a standard language to access transaction services directly. The Pathway Terminal Control Process (TCP) interprets traditional HP requester programs running on NonStop systems. These programs are written in COBOL. The Pathway/iTS product provides the TCP and the COBOL compiler.
What Is Application Availability? Design Program Modules for Availability Section 7, Availability Through Process-Pairs and Monitors, for a discussion of process pairs and checkpointing. Application designers can design applications that use process monitors and NonStop process pairs. However, for most needs, that part of the application design is transparently done for you by the transaction services of the NonStop Tuxedo or Pathway transaction-processing environments.
What Is Application Availability? Design Program Modules for Availability Applications That Run in the NonStop Tuxedo Open Environment The NonStop Tuxedo product provides an open interface for developing or porting applications for execution on NonStop systems. The combination of this product and the NonStop architecture bring the NonStop fundamentals, including availability, to applications that use this standard interface.
What Is Application Availability? Design Program Modules for Availability RSC/MP The NonStop RSC/MP product provides for highly available applications on hardware from multiple vendors. While not strictly open, RSC/MP allows the client part of the application to be written in a standard language such as COBOL, C, or C++ while gaining access to HP availability through its transaction semantics.
What Is Application Availability? Provide Instrumentation Pathway server. The HP server and its software, however, are designed so that availability features are retained even in a mixed application environment. Provide Instrumentation The costs of application downtime have already been discussed earlier in this section. The burden of keeping those costs low is often carried by the operations staff.
What Is Application Availability? Provide Application Performance Data Refer to Section 8, Instrumenting an Application for Availability, for more information. Provide Application Performance Data In order to establish the costs of downtime or the costs of running a system whose performance is degraded, it is useful to maintain information from which a measure of application activity can be obtained.
What Is Application Availability? Think Ahead About Changing the Application Online Think Ahead About Changing the Application Online Upgrading a departmental application from one revision to the next is often a major reconfiguration effort. This effort can be even larger for an enterprise level application. By taking the right steps when you first design the application, however, you can minimize or even eliminate reduced availability or downtime for this task.
What Is Application Availability? Analyzing Outages and Developing a Strategy Analyzing Outages and Developing a Strategy Clearly, the amount of effort you can put into making your application always available is considerable. In most cases, a step-by-step approach to availability is appropriate. Typically, as shown in Figure 1-2, a relatively small amount of effort can yield the most significant results early, when initially improving availability.
What Is Application Availability? Collecting Outage Data Collecting Outage Data For the next step towards increasing the availability of your application, you need to establish the causes or potential causes of application outage. If you are designing a new application, you should include the ability to collect outage data in your design. You can analyze your data after some appropriate period of time.
What Is Application Availability? Establishing a Phased Approach Figure 1-3. Collecting Outage Data Event Generation Without ApplicationMonitoring Server Client Application Application System Software System Processes EMS Network Manager $0 Event Generation Including the Application and theClient Server Client Application EMS Application System Processes EMS System Software Network Manager $0 Filter Collector Event Repository Outage Data Added Manually Report VST103.
What Is Application Availability? HP Services Help You Manage Availability cost to your organization of 1200 minutes of downtime each year, you can establish a priority for fixing the problem. This systematic approach allows you to increase availability within your budget constraints.
What Is Application Availability? Availability and HP Products recommendations for improving system balancing, resource utilization, and application performance. Performance Review and Analysis This service offers a more comprehensive alternative to the System Performance Audit. We evaluate your NonStop environment's use of system resources such as cache, disk I/O balancing, CPU utilization, and memory pressure; recommend changes to be implemented by your team; and analyze the changes' impact.
What Is Application Availability? • • • • • • Availability and HP Products Table 1-3 on page 1-28 lists the HP products and facilities that help to keep open applications available. Section 5, Increasing the Availability of Tuxedo Applications, provides more information about these facilities. Table 1-4 on page 1-29 lists the transaction-processing facilities that help to keep applications available that run in the Pathway transaction processing environment.
What Is Application Availability? Availability and HP Products Table 1-1. Server System and Network-Level Products That Help Provide Availability (page 2 of 2) The product... Helps provide availability because it... HP NonStop operating system Detects and compensates for problems in the system hardware, monitors processors, and provides the facilities to execute process pairs and process monitors.
What Is Application Availability? Availability and HP Products Table 1-3. Products That Keep Tuxedo Applications Available The product... Helps provide availability because it... NonStop TS/MP Provides support for server processes in open applications, including server process monitoring and linkage with clients. NonStop Tuxedo System /T Is the server component of the NonStop Tuxedo transaction monitor and makes use of the NonStop TS/MP product to provide additional value through HP fundamentals.
What Is Application Availability? Availability and HP Products Table 1-5. HP Communication Products That Help Provide Availability The product... Helps provide availability because it... Expand Provides a multiline, redundant, reliable communications protocol among NonStop systems in a network. Refer to the Expand manuals for details. Multilan (D-Series only) Provides a reliable NETBIOS communications protocol between NonStop systems and other systems connected on a LAN.
What Is Application Availability? Availability and HP Products Table 1-6. Language Compilers That Help Support Availability The product... Helps provide availability because it... C++ Supports the concept of reusable code. COBOL85 and native mode COBOL Supports a limited set of checkpointing functions and provides transaction control functions. NonStop C Supports process pairs with active backup. Refer to the COBOL85 Manual for details. Refer to the Guardian Programmer’s Guide for details.
What Is Application Availability? Availability and HP Products Table 1-7. Products and Facilities Related to DSM That Help Provide Availability (page 2 of 2) The product... Helps provide availability because it... EMS Analyzer Provides an easy way of analyzing EMS messages. Refer to the EMS Manual for details. EMS FastStart Provides an easy way to generate and test event messages from the application. Refer to the EMS FastStart Manual for details.
What Is Application Availability? Availability and HP Products Table 1-8. Products for Client Workstations That Help Provide Availability The product... Helps provide availability because it... NonStop Software Allows distribution of applications across a three-tier client/server architecture, in which at least one server tier runs Microsoft® Windows NT® Server. Refer to the NonStop Software manuals for details.
What Is Application Availability? Availability Guide for Application Design— 525637-004 1- 34 Availability and HP Products
2 Overview of Server and Network Fault Tolerance This section provides an overview of the fault-tolerant features of the hardware and system software environment in which the user's continuously available application executes. After introducing the concept of fault tolerance, it talks about fault tolerance on the server, the network, and in the client system. For technical details about a specific HP server, refer to the corresponding server description manual.
Overview of Server and Network Fault Tolerance Fault Tolerance in the Server System This is only true of a TNS/R or TNS/E duplex system, if you are using a triplex system based on the new NonStop advanced architecture of the HP Integrity NonStop NS-series servers then you will have two fault-tolerant levels of protection. When a similar failure occurs on equipment from most other vendors, an outage typically occurs. The failed component must be repaired to get the application back online.
Overview of Server and Network Fault Tolerance Fault Tolerance in the Server System For example, if a disk drive has an MTBF of 1 million hours (114 years), you would still endure, on average, 8.8 failures each year in a device population of 1000 disks. An MTBF of 1 million hours does not mean that a particular disk drive will not fail for 1 million hours. MTBF figures indicate the reliability performance of a device population during the useful life of the device.
Overview of Server and Network Fault Tolerance Parallel Hardware Components Parallel Hardware Components HP’s parallel architecture provides fault tolerance while remaining cost-effective. Unlike other fault-tolerant schemes, the HP approach is not based on lock-stepped components or idle “hot” standby components that are used only for the duration of a failure. The HP approach makes use of all available modules, thereby maximizing costeffectiveness.
Overview of Server and Network Fault Tolerance Parallel Hardware Components Figure 2-2. NonStop Hardware Architecture, S-Series Server Processor 0 Processor 1 Processor 2 Processor 3 Dual ServerNet Fabrics ServerNet Adapter ServerNet Adapter ServerNet Adapter (MIOE) Mirrored Disks VST202.vdd Interprocessor Communications Each S-series and NS-series processor is connected to all other processors by a pair of high-speed ServerNet fabrics.
Overview of Server and Network Fault Tolerance Parallel Hardware Components Figure 2-3. NonStop Hardware Architecture, K-Series Server Dual Interprocessor Buses Processor 0 Processor 1 Processor 2 I/O Process I/O Process I/O Process Multiple I/O Channels Mirrored Disk Disk Controller Disk Disk Disk Controller Disk Disk Controller Mirrored Disk Disk Disk Controller Multifunction Controller VST203.vsd Normally, both fabrics or both buses are used for communication between processors.
Overview of Server and Network Fault Tolerance Parallel Hardware Components The number of channels per K-series processor depends on which HP NonStop system you are using. Refer to the appropriate server description manual for details on the model of HP NonStop system that you have. Fault-tolerant management of devices on an S-series or NS-series server makes use of the inherent parallelism of the dual ServerNet fabrics to provide two data paths between a processor and an I/O device.
Overview of Server and Network Fault Tolerance Fault Isolation interruption. If a disk is repaired, its contents are updated once it is reintegrated into the server system. The update takes place concurrent with other system and application activity so that there is no interruption of the application. Failure of a mirrored disk is an exceptionally rare occurrence.
Overview of Server and Network Fault Tolerance Fault Isolation periodically sending out “I’m alive” messages to the other processors in addition to sending the same message to itself. Each processor periodically checks for “I’m alive” messages from all the other processors.
Overview of Server and Network Fault Tolerance Extensive Hardware Error Checking Hardware modules are also independent of each other and do not share critical states with other components. Processors do not share memory with each other. Critical components have backup power supplies and fault-tolerant cooling.
Overview of Server and Network Fault Tolerance • • Extensive Hardware Error Checking Disk subsystem Power supplies and fans Checking the ServerNet Fabric Each ServerNet fabric comprises a set of data routers. Each router has input and output connections to other routers, to a processor, or to a ServerNet addressable controller. Each router contains a self-checking application-specific integrated circuit. An address validation table (AVT) ensures that data is sent to the correct destination.
Overview of Server and Network Fault Tolerance System Process Pairs Data-Control Logic Checking of the data-control logic also involves parity and other checks. Processors The NonStop range of servers compares output from lock-stepped processors. Two identical processors execute the same code at the same time. Special logic verifies that the output of one chip is always the same as the output from the other chip.
Overview of Server and Network Fault Tolerance System Process Pairs are all different in the backup—making it much less likely that the combination of circumstances that caused the problem in the primary will exist in the backup. Not All System Processes Are Process Pairs Not all system processes need a backup running in another processor. Some, such as the memory manager, are concerned only with managing resources on their own processor.
Overview of Server and Network Fault Tolerance System Process Pairs Figure 2-5. A Disk Process: An Example of an I/O Process Pair Processor 0 Processor 1 Processor 2 Device tables Application process Primary disk process File system Backup disk process Dual ServerNet Fabrics ServerNet Adapter ServerNet Adapter Mirrored Disks VST205.vdd Checkpointing Data in the Disk Process The need to checkpoint data depends on whether the operation is retryable.
Overview of Server and Network Fault Tolerance System Process Pairs Synchronization The disk process and the application each maintain a synchronization block containing a synchronization identifier. When the system is operating normally, these synchronization identifiers are routinely kept synchronized. In the event of a takeover by the backup process, however, they provide a way to determine whether a write operation finished, and therefore whether there is a need to retry the operation.
Overview of Server and Network Fault Tolerance Instrumentation of System Components Instrumentation of System Components Instrumentation is another important technique for keeping system-level components fully functional. Hardware or software error detection has already been discussed earlier in this section, including the need to be able to isolate errant modules.
Overview of Server and Network Fault Tolerance • • • • Fault Tolerance in a Client/Server Network Data structure inconsistency Invalid input values (for example, invalid array indexes) Noninitialized or null pointers Procedure call error returns TFDS isolates software problems and provides automatic failure data collection, diagnosis, and recovery services.
Overview of Server and Network Fault Tolerance Additional Availability Problems in Client/Server Networks application, it is necessary that all components connecting the user to the server are available. Measuring Downtime of a Client/Server Application Client/server designs also further complicate the way downtime must be measured. A transient system error in a workstation is clearly a problem to the user of the workstation; the application is unavailable to that user, but other users are not affected.
Overview of Server and Network Fault Tolerance • • • • • Benefits of a Continuously Available Server Duplicate network addresses, such as duplicate IP addresses, are on the same network. A virus is spreading across network nodes. A name/security server is hung, or the database is incorrect. Operations stopped the application. A network-wide router reset has occurred. Many of these problems can be avoided by using the right hardware and application design. All can have their effects minimized.
Overview of Server and Network Fault Tolerance Benefits of a Continuously Available Server Figure 2-6. S-Series Client/Server Architecture With a Continuously Available Server Dedicated LAN HP Server System E4SA or GESA Processor 0 E4SA or GESA Processor 1 SWAN Converter WAN Public LAN Network Hub Client Processor LAN Card VST206.
Overview of Server and Network Fault Tolerance Use a Fault-Tolerant LAN Figure 2-7. K-Series Client/Server Architecture With a Continuously Available Server HP Server System Processor 0 LAN Controller Processor 1 Network Hub Client Processor LAN Card VST207.vsd Use a Fault-Tolerant LAN Redundant components provide the primary means for building a fault-tolerant LAN. Several vendors offer products that can supply varying degrees of fault tolerance.
Overview of Server and Network Fault Tolerance Use Fault-Tolerant Clients Figure 2-8. An S-Series Fault-Tolerant LAN HP Server System Primary LAN Processor 0 E4SA or GESA Backup LAN Processor 1 Network Hub Network Hub Client Processor Adapter Card VST208.vdd Single-point failure of any of these components is tolerated. Rapid fault detection and reporting ensures that the repair can take place as quickly as possible, thereby minimizing the repair window during which the LAN is not fault-tolerant.
Overview of Server and Network Fault Tolerance The HP Integrity NonStop NS-Series Server Provides Another Level of Availability tolerance is built into HP network service products at all levels of the OSI layer model, as shown in the next subsection. Section 6, Availability in the Pathway Transaction-Processing Environment, discusses techniques for replicating server processes. Section 7, Availability Through ProcessPairs and Monitors, discusses techniques for creating persistent and NonStop processes.
Overview of Server and Network Fault Tolerance The HP Integrity NonStop NS-Series Server Provides Another Level of Availability Figure 2-9.
Overview of Server and Network Fault Tolerance The HP Integrity NonStop NS-Series Server Provides Another Level of Availability Processes called synchronization and rendezvous at the LSUs perform two main functions: • • To keep the individual processor elements (PEs) in a logical processor in loose lock-step through a technique called rendezvous. Rendezvous occurs to: ° Periodically synchronize the PEs so they execute the same instruction at the same time.
Overview of Server and Network Fault Tolerance ServerNet Clustering for Availability and Performance ServerNet Clustering for Availability and Performance Dual ServerNet fabrics provide a fast, efficient, and reliable way for the processors to exchange messages. ServerNet technology can also be used to connect servers in groups called ServerNet clusters. ServerNet clusters extend the ServerNet X and Y fabrics outside the system boundary and allow ServerNet to be used for messaging between systems.
Overview of Server and Network Fault Tolerance ServerNet Clustering for Availability and Performance ServerNet clustering provides several benefits: • • • • Performance. For interprocessor communication, ServerNet clusters take advantage of the NonStop message system for low message latencies, low message processor costs, and high message throughput. The same message system is used for interprocessor communication within a node and between cluster nodes.
Overview of Server and Network Fault Tolerance ServerNet Clustering for Availability and Performance Availability Guide for Application Design—525637-004 2- 28
3 Improving Availability on the Internet This section contains an overview of open standards that may be used in conjunction with existing HP application software services to provide applications, running across the Internet or private networks, with all of the unique HP NonStop server attributes such as availability, scalability and manageability.
Improving Availability on the Internet Standards and the Operating System Figure 3-1, Standards Supported and Enhanced by NonStop Products, on page 3-2, shows some of the current standards and the corresponding NonStop products that transform them into a dynamic computing environment. Figure 3-1.
Improving Availability on the Internet Standards and Languages on. OSS supports more than 90 percent of UNIX 98 commands and application program interfaces (APIs), UNIX command shell (korn), and standard UNIX facilities: shared memory, semaphores, pipes, signals, sockets, message queues, and dynamic link libraries (DLLs). Also provided are the POSIX command shell, utilities, and threading library, as well as X/Open internationalization localization APIs and utilities.
Improving Availability on the Internet Standards and the Database The ETK allows you to edit, compile, build, and deploy applications written in different native programming languages through flexible GUI tools. Key features of the Enterprise Toolkit include: • • • • • • • • • Tight integration with the Visual Studio .NET environment NonStop server-specific Visual Studio .
Improving Availability on the Internet Standards and the Database NonStop SQL/MP Database was originally introduced in 1987; NonStop SQL/MX Database, introduced in 2001, supports the Core SQL:2003 standard.
Improving Availability on the Internet Standards and the Network Standards and the Network NonStop systems have a variety of adapters to support a variety of network standards. Besides the very efficient Ethernet adapters that are able to operate at speeds up to 1 Gb per second, HP continues to support adapters that can handle asynchronous, bisynchronous, synchronous data link control (SDLC), and several unusual protocols and speeds.
Improving Availability on the Internet Standards and Application Integration NLP provides the flexibility for enterprises to create geographic business regions, business division regions, function-oriented regions, or customer-oriented regions.
Improving Availability on the Internet Standards and Application Integration Figure 3-2. Web Services and NonStop Servers Windows NonStop Server Java Application .NET Tuxedo XML SOAP UNIX HTTP NonStop SOAP Pathway Application CORBA SOAP Client VST112.vsd For more information about Web services see Standards and Web Services on page 3-9. Enterprise Application Integration is important because: • Integration is one of the top customer challenges, consuming up to 40 percent of total IT budgets.
Improving Availability on the Internet Standards and Web Services to incorporate visual design and runtime management of complex business and integration processes that may be automatic or require human intervention. Business Process Management (BPM) enables the centralization of business processes on a NonStop Server with that server acting as a hub for all the information.
Improving Availability on the Internet The Pathway Application Server All of the core standards for Web services are approved and well known. There are more Web services standards in the works for things like security and complex transactions. All major IT vendors offer support for the core standards, so it is common for Web services to be implemented through a hybrid combination of platforms and vendors. Both .NET and Java have significant product functionality in support of Web services.
Improving Availability on the Internet Web Server support method for communications is via a SOAP message or document. To facilitate that communication, the SOAP server is able to generate a Web Services Description Language module (WSDL) that is "passed to" the other application. This WSDL enables the other application to know how to send/receive messages to/from the Pathway application.
Improving Availability on the Internet The iTP Secure WebServer improve overall performance. In addition to file opens, already cached, the file information as well as the actual file content can also be cached. • Encryption and authentication flexibility The iTP Secure WebServer supports the use of the HTTP, SSL, PCT, and hardwarebased cryptography provided by WebSafe2 units. Secure HTTP supports the simultaneous use of both the SSL and HTTP protocols.
Improving Availability on the Internet • The iTP WebServer Architecture Resource Locator Service (RLS) This service lets you define multiple web servers to be used interchangeably for access to the same URLs. The requester need not know which server handled a request. For a complete list of the standards supported by the iTP WebServer, consult the iTP Secure WebServer System Administrator’s Guide. This Guide contains all the information required to install and run the iTP WebServer.
Improving Availability on the Internet The iTP WebServer Architecture Figure 3-3.
Improving Availability on the Internet The iTP WebServer Architecture Conventional TCP/IP In essence, conventional TCP/IP has one listening process on each port. The conventional TCP/IP connections are managed by the Distributor process. The Distributor receives all incoming requests for new connections from the TCP/IP processes and used to previously distribute them to the iTP Secure WebServer, using the NonStop TS/MP Pathsend facility. Beginning with iTP webServer 4.
Improving Availability on the Internet The iTP WebServer Architecture automatically in response to changes in workload. NonStop TS/MP can also restart a server process that fails. (The iTP Secure WebServer uses the default value of the PATHCOM AUTORESTART parameter.) PATHMON Process The PATHMON process provides centralized monitoring and control of a PATHMON environment consisting of server classes and other types of objects.
Improving Availability on the Internet The iTP WebServer Architecture Servlet Server Class (SSC) The Servlet Server Class (SSC), also known as the Web Container, lets you write CGI applications as Java servlets. The servlets execute in SSC processes, which are scalable and persistent because they run under NonStop TS/MP. Resource Locator Service (RLS) RLS allows the implementation of replicated web servers, to be used interchangeably and transparently for access to the same content and services.
Improving Availability on the Internet The WebLogic Server The WebLogic Server The WebLogic Server is a standards-based J2EE application server that provides a foundation for building applications. It must be purchased from BEA Inc. to run on NonStop servers.
Improving Availability on the Internet Specific Enhancements balancing and configuration options. The WS Plug-in also supports web applications that utilize session-based dialogs. • • Avitek Medical Records (or MedRec) sample application suite that works with the WebLogic Server. Documentation for supporting various software pieces that form the NonStop Server Toolkit. Note. Applications using WebLogic Server must use SQL/MX to access NonStop SQL tables.
Improving Availability on the Internet Availability Guide for Application Design—525637-004 3- 20 Specific Enhancements
4 Data Protection and Recovery The database-related products introduced in this section provide protection of your data not only if a single component fails, but also in the unlikely event of multiple component failure, system failure, or even catastrophic site failure. You can also protect your data against possible corruption due to concurrent access by simultaneous online transactions or by contention between online transactions and batch-mode operations.
Data Protection and Recovery What Is a Transaction? having to deal with complex checkpointing operations. In addition, the product itself is designed to be highly available by allowing online reconfiguration and tolerance of error conditions. TMF sustains high performance in transaction-processing applications. To support transaction-processing applications, TMF can manage thousands of complex transactions sent by hundreds of users to a common database through multiple interfaces.
Data Protection and Recovery What Is a Transaction? Each transaction is not affected by other transactions that are executing concurrently. Transactions can take place in any order and the correctness of the database transformation is not affected. • It is durable. Once a transaction successfully finishes (it commits), its changes are permanent and will survive even if a failure occurs. Each of these topics is discussed in the following paragraphs.
Data Protection and Recovery Transaction Coordination Transactions Are Isolated To process the database correctly, the application must be able to assume that its input from the database is consistent, regardless of any concurrent changes being made to the database. The following example shows the need to control concurrent access so that each transaction effectively executes in isolation.
Data Protection and Recovery Transaction Coordination protected queue. These server processes might be on one network node or they might be scattered on many nodes throughout the network. Figure 4-1.
Data Protection and Recovery Database Recovery server processes to manipulate the database or place a request on a transactionprotected queue, the system assigns the same transaction identifier to each server process. For multithreaded requesters and servers, each thread that initiates or participates in a transaction will have its own unique transaction identifier. For transactions that successfully finish, the end point involves committing all updated records to disk.
Data Protection and Recovery • Audit Trails File recovery Figure 4-2 summarizes these protection mechanisms with respect to the transaction paradigm. It shows which mechanisms protect the data associated with one transaction, depending on whether the transaction has not yet begun, has started but not yet finished, or has already committed. An overview of these mechanisms follows. For more details, refer to the Introduction to TMF. Figure 4-2.
Data Protection and Recovery Transaction Backout The audit trail is kept on a separate disk or mirrored disk volume to ensure that the same failure does not destroy the database file or table and the active audit volume. The audit trail can span multiple volumes, if so configured, and thus reduce the chance of the application becoming unavailable if the audit trail fills up.
Data Protection and Recovery Online Dumps and Audit Dumps Online Dumps and Audit Dumps A combination of online dumps and audit dumps make file recovery possible in the event of: • • • Corruption or loss of an audited data file due to an application error, operations error, or a hardware failure while running a non-fault-tolerant configuration Environmental catastrophe Corruption of an audit-trail volume An online dump provides a complete, historical view of a data file while continuing to allow updates
Data Protection and Recovery Interfaces From a NonStop Tuxedo Client Process transaction. This design empowers developers to protect all critical data while minimizing the scope of the transaction. Most supported languages provide an interface to these features. Not only are these features available to programmers writing code to use HP interfaces on an HP NonStop system, but they can also be accessed indirectly from client processes.
Data Protection and Recovery RSC/MP Interface Table 4-2. POET Gateway Interface to Transaction Control Facilities Function... In the Microsoft Windows95 or Windows NT environment is provided by the statement... To start a transaction PoetBeginTransaction or PoetWTBeginTransaction To commit a transaction PoetEndTransaction or PoetWTEndTransaction To abort a transaction PoetAbortTransaction or PoetWTAbortTransaction The POET API supports both waited and nowaited client interfaces to the HP host.
Data Protection and Recovery Interface on an HP NonStop System Interface on an HP NonStop System For requester, server, or monolithic programs (programs that combine requester and server features) that execute on an HP NonStop system, you can access the transaction control features of TMF through the TMF program interface from the following supported languages: COBOL85, COBOL, NonStop SQL/MP, C, TAL, and Pascal.
Data Protection and Recovery Using the Remote Duplicate Database Facility (RDF) Products Using the Remote Duplicate Database Facility (RDF) Products HP provides for site-level or environmental fault tolerance with the NonStop RDF (Kseries only), NonStop RDF/MP, and NonStop RDF/MPX products.
Data Protection and Recovery Remote Duplicate Transactions time needed for the backup site to take over, and the time needed to switch back again to the primary site once the primary site is restored. The time required for the backup system to take over depends on the application but is usually several minutes. Upon failure at the primary site, RDF processes at the backup site must finish updating the database with the received image records.
Data Protection and Recovery Remote Duplicate Transactions To guard against failure of the switching system, an alternate switching system is available and associated with the backup site. The alternate switching system is able to take over at any time because of the status information it receives from the primary switching system with each transaction.
Data Protection and Recovery Remote Duplicate Transactions Figure 4-4. Remote, Duplicate Transactions Provide Protection Against Site Failure New York Chicago $AUDIT $AUDIT Server Server $DBNY $DBNY Switch Switch Read Inbound Packet Begin Transaction Write to Database in New York Write to Database in Chicago Send Status to Chicago Switch End Transaction Write Outbound Packet Normal Information Flow Network Flow When Primary Switch Down VST304.
Data Protection and Recovery Remote Duplicate Transactions When Transaction Logic Executes in the Server Figure 4-5 shows a similar scheme with the transaction logic running in the server. New York has the primary system. Again, the backup system is located in Chicago. Figure 4-5.
Data Protection and Recovery Database Management and Availability to two locations. The application is more complex to develop and throughput might be reduced while waiting for ready-to-commit messages from remote sites. Database Management and Availability This subsection introduces the features of NonStop SQL/MP that eliminate or reduce downtime of the database or of the SQL programs that access the database.
Data Protection and Recovery Queue Files Tolerance, and by remote duplicate databases as described under Duplicate Databases on page 4-12. To reduce the need for planned outages, the NonStop SQL/MP product has been specifically designed to protect your data while allowing significant reconfiguration and other operations to be done without bringing the application down.
Data Protection and Recovery • Using Queue Files The term trickle catchup describes a technique that can be used in applications where the order of requests must be maintained. Queue files are a generic concept. For details on how to write programs using a specific implementation, refer to the Queue Manager Manual.
Data Protection and Recovery Transaction Playback You can use queue files to help perform transaction playback to keep such an application available. The simplified application shown in Figure 4-6 on page 4-21 shows this concept. Here, the point-of-sale devices use the application to check credit card references and debit the account.
Data Protection and Recovery Transaction Playback Normal Operation Normally, the application proceeds as follows: 1. The merchant enters the credit card information at the point-of-sale device. 2. The client process formulates a credit check request from the data entered by the merchant and sends it to the server. 3.
Data Protection and Recovery Trickle Catchup Trickle Catchup When the order in which requests are executed is important, you can use a similar technique sometimes known as “trickle catchup.” As with the transaction playback example, this approach can be successful in applications that can get by for a short period without immediate response to the user. Figure 4-7 on page 4-23 shows the design. During normal operation, the client process sends requests to the server process for processing.
Data Protection and Recovery Eliminating Batch Windows Eliminating Batch Windows Traditionally, batch-mode applications capture transactions during hours of business and update the database of record during the night. Each day starts with an up-to-date database from which accurate historical data can be gleaned, statements can be printed, and so on. The Problem As discussed in Section 1, What Is Application Availability?, businesses are staying open longer. Consider a bank.
Data Protection and Recovery • • Solution Using NonStop Operating System in Support of NetBatch-Plus Software Information cataloging Decision support queries against production data Read/write tasks include: • • • • • Inserting a large number of new rows into the database; for example, when loading or merging data Updating a large number of existing rows in the database Populating new columns in the database Computing derived data; for example, in support of decision support or summary reporting Delet
Data Protection and Recovery Solution Using Low-Priority Transaction Processing to Perform the Batch Function Design Considerations in Support of Batch Operations You should consider in advance the needs of the batch operations on your database. Consider what types of summary reports, data transformations, maintenance updates, and so on will likely be required.
Data Protection and Recovery Solution Using Low-Priority Transaction Processing to Perform the Batch Function Figure 4-9. Concurrent Batch and Online Activity Batch-type Transactions Online Transaction Utility Pending Transactions Requester Requesters High-Priority OLTP Server Class Batch Server Class Running in the Background Server Database of Record Server VST309.
Data Protection and Recovery Solution Using Low-Priority Transaction Processing to Perform the Batch Function online function requiring no operator and are able to execute concurrently with online processing. What Are the Design Implications for the Transactional Model for Batch Programs? Consider the size of the unit of batch recovery. The simplest approach is to have a single transaction that processes the entire batch job.
Data Protection and Recovery Solution Using a Database Snapshot to run the transaction again if a failure causes the transaction to abort. If your batch application runs as a process pair, then you must checkpoint this information to the backup. Refer to Section 7, Availability Through Process-Pairs and Monitors, for information on checkpointing.
Data Protection and Recovery Solution Using a Database Snapshot application. Using RDF, however, you have a tradeoff between having a completely accurate snapshot and taking the snapshot instantaneously. If you want the snapshot to be completely accurate, with no partially committed transactions, you can do so by tolerating a brief outage on the primary system while the databases are synchronized.
Data Protection and Recovery Solution Using a Database Snapshot within the lag time period. For example, if the lag time is 7 minutes and the snapshot is taken 2:07 a.m., you can peruse an accurate snapshot of the database at 2:00 a.m. Performing Remote Duplicate Transactions By writing your own application to perform remote duplicate transactions, you can obtain an accurate snapshot of the database without any interruption to online operations. Writing such an application, however, is not an easy task.
Data Protection and Recovery Solution Using a Database Snapshot Availability Guide for Application Design—525637-004 4- 32
5 Increasing the Availability of Tuxedo Applications This section provides an overview of the availability features of HP’s open transactionprocessing environment. It discusses applications that run in the NonStop Tuxedo transaction-processing environment. This environment offers the HP fundamentals of parallelism, scalability, availability, and manageability through a standard interface.
Increasing the Availability of Tuxedo Applications Availability Concepts Used in Open Applications Figure 5-1. Open Transaction Monitor Tuxedo Client Front-End Application Client Component of Transaction Monitor Tuxedo/WS HP NonStop Tuxedo Transaction Monitor (System/T) Server Components of Transaction Monitor Server Class NonStop Tuxedo Servers NonStop TS/MP OSS Guardian System API HP NonStop OS TNS/R or TNS/E Hardware VST401.
Increasing the Availability of Tuxedo Applications • High-Availability NonStop Tuxedo Applications Process pairs with initialized persistence A process runs with a backup process that takes over if the primary process stops. The backup process is preinitialized by a single checkpoint operation that occurs when the primary process finishes its initialization phase. • Process pairs with continual checkpointing A process runs with a backup process that takes over if the primary process stops.
Increasing the Availability of Tuxedo Applications How Does NonStop Tuxedo Work? program that uses the Application Transaction Monitor Interface (ATMI) or TX standard interface can do this. This subsection introduces the NonStop Tuxedo product and its availability features. First, it establishes how the components that make up this product normally function and includes a skeletal work session.
Increasing the Availability of Tuxedo Applications How Does NonStop Tuxedo Work? capable of handling several clients. It converts ATMI or TX calls made by the client process into calls understood by the NonStop TS/MP and TMF HP core services. Bulletin board support includes: • A bulletin board (BB) that contains configuration and status information about an application. This set of data structures contains information about servers and services and about the clients that connect to the application.
Increasing the Availability of Tuxedo Applications Availability of NonStop Tuxedo Applications process to make sure that enough servers are available to handle the workload, control of the distributor process pairs that provide load balancing among the servers, and server process restart capabilities. • A link manager process that uses information in the bulletin board to provide linkage between a WSH or a native client process and a server.
Increasing the Availability of Tuxedo Applications Availability of NonStop Tuxedo Applications by the application in the client. The WSL and DBBL processes run with immediate persistence, while other processes—such as the WSL process—are restarted as needed. As shown in Figure 5-2 on page 5-5, several processes are involved in executing a NonStop Tuxedo application.
Increasing the Availability of Tuxedo Applications Availability of NonStop Tuxedo Applications addition, active TMF transactions started elsewhere but encompassing the failing server process are also aborted. To recover the application to the point of failure, it is up to the client process to restart the aborted transaction. When the restarted transaction requests a service, the link manager process provides the linkage between the WSH process and any member process of the appropriate server class.
Increasing the Availability of Tuxedo Applications Availability of NonStop Tuxedo Applications processor failure, then the supervisor process restarts it in a different processor. If only one processor is configured, then the WSL process is not restarted until the processor comes back online. When the WSL process restarts, it adopts any WSH processes that it previously orphaned.
Increasing the Availability of Tuxedo Applications Availability of NonStop Tuxedo Applications Whether the TCP/IP process has a backup process or not, all connections to workstation clients through the failing TCP/IP process are lost. In addition, any currently active TMF transactions initiated by the WSH process on behalf of workstation clients are aborted. Workstation clients must detect errors, reconnect, and restart the failed transactions.
Increasing the Availability of Tuxedo Applications Availability of NonStop Tuxedo Applications If the workstation client uses dynamically bound /WS libraries, then, depending on the client platform, the WSH process might not detect the error; in this case, it does not abort transactions. Recovery From a Machine Failure If a machine failure occurs, any currently active transactions that involve the failed machine are aborted.
Increasing the Availability of Tuxedo Applications Design Implications for NonStop Tuxedo Applications Recovery From a Network Failure Between a WSH Process and a /WS Client Process If communications failure occurs between a server machine and a client workstation, then the failure might not be detected until an operation that uses the network is attempted. The WSH process aborts any outstanding TMF transactions started on behalf of the client process.
Increasing the Availability of Tuxedo Applications • • Design Implications for NonStop Tuxedo Applications Ensure that your server design is not prone to performance problems that can slow your application beyond the acceptable limits specified in your company’s business plans. For example, you should avoid mixing long-running transactions and short transactions in the same server.
Increasing the Availability of Tuxedo Applications • Design Implications for NonStop Tuxedo Applications Establish a policy for restarting native/T client processes. These processes can only be restarted manually.
6 Availability in the Pathway Transaction-Processing Environment This section provides an overview of the availability features in the Pathway transaction processing environment. Described here are the traditional HP products and interfaces that support client/server or requester/server applications. Note.
Availability in the Pathway Transaction-Processing Environment Availability Concepts Used in Pathway Applications This section does not discuss the NonStop Tuxedo system. Refer to Section 5, Increasing the Availability of Tuxedo Applications, for details on this interface. Nor does it discuss the Subsystem Programmatic Interfaces (SPI) that enhance availability by supporting instrumentation and command and response interfaces.
Availability in the Pathway Transaction-Processing Environment • Availability Concepts Used in Pathway Applications Process pairs with initialized persistence A process runs with a backup process that takes over if the primary process stops. The backup process is preinitialized by a single checkpoint operation that occurs when the primary process finishes its initialization phase.
Availability in the Pathway Transaction-Processing Environment NonStop TS/MP and Highly Available Server Processes NonStop TS/MP and Highly Available Server Processes NonStop TS/MP supports server processes through monitoring, load balancing, and providing linkage with requesters. Figure 6-2 on page 6-5 shows the major components of NonStop TS/MP and the requesters for which they secure services. What Is NonStop TS/MP? The major components of NonStop TS/MP are PATHMON and LINKMON.
Availability in the Pathway Transaction-Processing Environment • • • Availability of NonStop TS/MP Server Processes An operator intentionally or unintentionally stops the server process. The processor in which the server process is running fails. The server process encounters errors it is unable to recover from. Figure 6-2.
Availability in the Pathway Transaction-Processing Environment Availability of NonStop TS/MP Server Processes 1. TMF backs out any uncommitted transactions that the server was involved in at the time of the failure. In doing so, the database returns to a consistent state and each requester process resumes at the start of the aborted transaction. Refer to Section 4, Data Protection and Recovery, for a discussion of how TMF does this. 2. The requester or client restarts the transaction. 3.
Availability in the Pathway Transaction-Processing Environment Design Implications for Server Processes Design Implications for Server Processes The availability of server processes under NonStop TS/MP is largely transparent if appropriate application design considerations and appropriate operational considerations are followed. As with all business applications, responsibility for correct design lies with both development and operations.
Availability in the Pathway Transaction-Processing Environment • • • Pathway/XM and Highly Available Server Classes Be sure to specify a list of processors when you specify a server class. Otherwise, all members of the server class run in one processor; if you lose that processor then you lose the entire server class. Consider how many servers you should run in a server class.
Availability in the Pathway Transaction-Processing Environment What Is Pathway/XM? Figure 6-3. The Role of Pathway/XM in the Pathway Transaction-Processing Environment Pathway/XM Environment PXMCOM PB SuperCTL PB LCS $CMON HP NonStop TS/MP Environment PATHMONs Distributed Server Classes LINKMONs Workstations and Terminals Direct Server Classes Databases TCPs VST503.
Availability in the Pathway Transaction-Processing Environment Availability of Pathway/XM Server Classes The PB processes are NonStop process pairs that manage system and CPU resource assignments for requesters such as LCS processes and for direct server class servers. Direct server class server processes are legacy Pathway applications that have not been modified to take full advantage of Pathway/XM. The LCS processes manage transaction requests to distributed server classes.
Availability in the Pathway Transaction-Processing Environment • • Design Implications for Server Processes If a processor fails, the server processes running in that processor are no longer available to service transaction requests, so the length of the request queue increases. That increase causes the LCS process to start new server processes on other processors within the logical Pathway/XM node object. That action automatically redistributes the transaction workload to minimize processing delays.
Availability in the Pathway Transaction-Processing Environment How Does RSC/MP Work? RSC/MP permits workstations to invoke NonStop Transaction Services/MP (TS/MP) servers on NonStop servers. By providing a client-server environment, RSC/MP can improve the performance of NonStop TS/MP applications while maintaining the ability to handle high-transaction volumes. RSC/MP provides the link between NonStop servers and client workstations over existing communication networks.
Availability in the Pathway Transaction-Processing Environment How Does RSC/MP Work? On the workstation: • • The client process uses the RSC/MP application program interface to establish connections and sessions with the Transaction Delivery Process (TDP) on the server system and to send requests through the TDP to server processes.
Availability in the Pathway Transaction-Processing Environment How Does RSC/MP Work? 3. The client process initiates a session with the TDP (RscBeginSession() function). 4. The client process sends a request to the TDP to begin a transaction (RscBeginTransaction() function). 5. The TDP converts the RSC/MP call in the message into a call to the TMF interface. A TMF transaction begins and the TDP gets a transaction identifier, a value that uniquely identifies the transaction that has just begun. 6.
Availability in the Pathway Transaction-Processing Environment How Does RSC/MP Work? The remainder of this subsection discusses how the RSC/MP products, the application code, and appropriate operation combine to protect the application from these potential failures. How Availability Works RSC/MP provides you with protection against the loss of a communications line or the loss of the TDP.
Availability in the Pathway Transaction-Processing Environment Design Implications for RSC/MP Applications Replicated TDP or Initialized Persistent TDP The TDP is critical to the function of RSC/MP. Loss of the TDP causes termination of all workstation sessions, termination of all outstanding I/O operations, and termination of all outstanding transactions made though that TDP.
Availability in the Pathway Transaction-Processing Environment Design Implications for RSC/MP Applications Coding Transactions and Saving Context for Availability To make failure recovery possible, your client program must do the following during normal operation: • • • • • Protect all server requests that cannot be retried with TMF transactions. These are the types of requests that cannot be processed more than once without adverse effects; for example, withdrawing a sum of money from a bank account.
Availability in the Pathway Transaction-Processing Environment Availability Through Pathsend Operational Concerns To ensure application availability, the operations staff must: • • • • • Decide whether to run the TDP as a process pair with initialized persistence. A process pair with initialized persistence can take over in less time than a replicated TDP. Retain the TDP configuration following a total failure of the TDP by maintaining a command file that will correctly configure the TDP at startup.
Availability in the Pathway Transaction-Processing Environment How Does the Pathsend Facility Work? Pathsend requesters are typically used where the number of transactions is high but the number of devices is low. For example, Pathsend requesters often provide message control for client processes running on systems other than HP systems. The TDP explained under Availability Through RSC/MP on page 6-11 is one example of such a Pathsend requester.
Availability in the Pathway Transaction-Processing Environment Availability of Pathsend Applications 2. The Pathsend requester formulates a request and sends it to a server class using a server-class send operation. The transaction identifier is carried with the I/O request to involve the actions of the server in the transaction. 3. The link manager function checks whether a link already exists with a server process in the specified server class. If not, working with the PATHMON process, it creates one.
Availability in the Pathway Transaction-Processing Environment Availability of Pathsend Applications To keep the application available to the users of a specific instance of a requester, the requester must run as a process pair.
Availability in the Pathway Transaction-Processing Environment Availability of Pathsend Applications transactions are backed out. The application remains available, but the end user must be made aware of the failure and from what point to start reentering data. Requester Process With Initialized Persistence A requester process that runs as a process pair with initialized persistence can take over in its backup process if the primary process fails. Operations protected by transactions are backed out.
Availability in the Pathway Transaction-Processing Environment Design Implications for Pathsend Requesters Nonretryable requests that are not protected by TMF cannot be processed more than once without adverse effects. An example of this kind of request is a request to subtract $50.00 from a bank account balance. For these requests, there is no way for the server class to detect duplicate requests; Pathsend does not support checkpointing of synchronization identifiers.
Availability in the Pathway Transaction-Processing Environment Availability Through Pathway/iTS Operational Concerns Consider the following operational concerns when designing a Pathsend application: • • If you decide to use neither process persistence nor process pairs in your Pathway requester design, then operations staff might need to take responsibility for getting failed Pathway requesters restarted.
Availability in the Pathway Transaction-Processing Environment How Pathway/iTS Works Figure 6-6. Components Supporting Pathway/iTS Applications Pathway/iTS NonStop TS/MP TCP Server Class Terminals PATHMON IDS Server TCP Automated Teller Machines Database VST506.vdd A TCP executes screen programs and coordinates communication between such a program, its terminal, and the server processes it calls on.
Availability in the Pathway Transaction-Processing Environment Availability of TCP Applications 5. The COBOL program might perform Steps 2, 3, and 4 again with additional requests to other servers. 6. The requester ends the transaction by starting the two-phase commit protocol. Availability of TCP Applications TCP requesters depend on a combination of availability techniques to keep the application online. TMF transactions provide a known point of consistency for application restart following a failure.
Availability in the Pathway Transaction-Processing Environment Design Implications for TCP Requesters A TCP running without persistence does not provide any protection against TCP failure. The NonStop TS/MP System Management Manual describes how to use the NONSTOP and AUTORESTART parameters. Keeping the application running requires more than simply keeping the TCP running. If the TCP fails during normal operation, the following happens, assuming the TCP is configured to run as a process pair: 1.
Availability in the Pathway Transaction-Processing Environment Summary and Comparison of Application Components Operational Concerns Consider the following operational issues when designing a highly available TCP application: • Consider whether the TCP should be run as any of the following: ° ° ° A process pair with continual checkpointing An initialized persistent process An immediately persistent process or whether it should be run without any of these features.
Availability in the Pathway Transaction-Processing Environment Summary and Comparison of Application Components The application development team must design and code the feature into the application. • Not applicable The technique cannot be used with the application entity.
Availability in the Pathway Transaction-Processing Environment Summary and Comparison of Application Components Table 6-2.
7 Availability Through Process-Pairs and Monitors Most applications do not need to use process-pair primitives directly.
Availability Through Process-Pairs and Monitors When to Use Process Pairs You can find programming guidelines for active backup process pairs in the Guardian Programmer’s Guide. For passive backup, refer to the appropriate procedure call descriptions in the Guardian Procedure Calls Reference Manual. When to Use Process Pairs If the transaction monitor and transaction management facilities will not work for your application, you can consider writing your own process pair.
Availability Through Process-Pairs and Monitors Approaches to Takeover Approaches to Takeover The challenge in designing a process pair is making sure that the backup process on takeover has the same context, data, and state information that the primary process had when it failed. To achieve this condition, the backup process must take over processing from a known point slightly before the point where the primary process failed.
Availability Through Process-Pairs and Monitors Operations That Are Not Retryable gets fired by the backup process. If the missile was fired by the primary process, it is no longer in the silo to be fired by the backup process. Operations That Are Not Retryable Operations that are not retryable include firing the next missile, printing a check, and dispensing cash from an automated teller. Clearly, to repeat one of these operations would be inappropriate if not disastrous.
Availability Through Process-Pairs and Monitors How Do Process Pairs Work? Figure 7-1. Process Pair—Generic Model $vol.appl.obj CPU 0 CPU 1 Primary Backup Control State and Data State Operating System Operating System System messages indicate need for takeover VBST601.vdd The primary process uses interprocess communication to send critical information to the backup process.
Availability Through Process-Pairs and Monitors Passive Backup Model Passive Backup Model This subsection discusses how the functions of the general process pair model discussed under How Do Process Pairs Work? on page 7-4 are done in the passive backup model. The discussion assumes that you take advantage of the Guardian checkpointing procedures.
Availability Through Process-Pairs and Monitors • • Sending Process-State Information to the Passive Backup Critical data. This data is application dependent but might include data just read from a terminal, data about to be written to disk, or data maintained in processor memory in the primary. File synchronization information to make file system input or output retryable.
Availability Through Process-Pairs and Monitors Receiving Information in the Passive Backup The amount of data that can be checkpointed using a restart checkpoint is a little under 32 kilobytes. For stacks that are larger than this, you can checkpoint in smaller amounts, but only the last checkpoint will contain a restart point. In an NonStop native process, you cannot checkpoint global data with the stack, because global variables are not stored adjacent to the stack.
Availability Through Process-Pairs and Monitors Receiving Information in the Passive Backup Receiving Checkpoint Information The CHECKMONITOR procedure receives all checkpoint information from the primary, including: • • • File-open information Critical data Control-flow information On receipt of a file-open message from the primary process, the CHECKMONITOR procedure performs a backup open on the same file and gets an access control block for it.
Availability Through Process-Pairs and Monitors Takeover by the Passive Backup If the processor in which the primary process is running fails, then the operating system on that processor has no way of informing the backup process that the primary process is deleted. The backup process must therefore monitor the primary process’s processor by listening for system messages that indicate loss of the primary’s processor or the inability to communicate with it.
Availability Through Process-Pairs and Monitors Takeover by the Passive Backup to the corresponding open from the primary process. It is this relationship that enables file synchronization to work. The checkpointed file synchronization information helps to determine how subsequent I/O operations are handled. Some of these operations might already have been executed by the primary process.
Availability Through Process-Pairs and Monitors Active Backup Model 7. The file system updates the application’s synchronization identifier. The disk process and application are once again synchronized. If the primary application process fails before its next checkpoint, the backup process takes over and reissues the write request. The disk process accepts the request only if the synchronization identifiers match.
Availability Through Process-Pairs and Monitors Sending Process-State Information to the Active Backup standard error files are created with the same file-state information as in the primary process. Once the backup process is created, the primary process can open it and begin sending data-state and control-state messages to it using interprocess communication.
Availability Through Process-Pairs and Monitors Receiving and Processing Information in the Active Backup performance of the primary process is lower because it is sending additional state update messages; takeover time, however, is relatively fast. Conversely, if the backup process computes the state, then the primary process runs faster but the takeover time is longer due to the extra computation.
Availability Through Process-Pairs and Monitors Takeover by the Active Backup Opening Files in the Backup Process Every file that is open in the primary process and must be synchronized must also be open in the backup process. The backup process must use the __ns_backup_fopen() function to do this. To open each file with the same open status as in the primary process, the backup process must supply the file-open status to the __ns_backup_fopen() function.
Availability Through Process-Pairs and Monitors Active Backup With TAL To take over as the primary process, the backup process must call the PROCESS_SETINFO_ Guardian procedure and specify that it is to become the primary process. Having done so, it can continue processing from the logical point indicated in the control-flow information received from the primary process. Processing continues using the current data state of the backup process.
Availability Through Process-Pairs and Monitors Comparing Active Backup With Passive Backup Refer to the Guardian Procedure Calls Reference Manual for details on the Guardian procedures listed in Table 7-1. Comparing Active Backup With Passive Backup The passive-backup model achieves fault tolerance by copying data from the primary process, to the same address in the backup process. Conversely, the active-backup model achieves fault tolerance by maintaining a logically identical process.
Availability Through Process-Pairs and Monitors Passive Backup Is Easier to Design not written specifically for the HP passive-backup model will easily convert to run as a process pair. In the active-backup model, many applications containing hidden state information can be modified to run as a process pair so long as the hidden state information is not critical to the execution of the process.
Availability Through Process-Pairs and Monitors Nowait I/O and Multiple Requests to the Same File The following paragraphs describe some of the complexities of performing nowait I/O operations in process pairs. For details on nowait I/O, refer to the Guardian Programmer’s Guide.
Availability Through Process-Pairs and Monitors Language Issues A further complication in the passive-backup model involving multithreaded design is the fact that checkpoints are always waited operations. Every time a thread issues a procedure call to checkpoint some information, the entire process suspends until the checkpoint operation finishes. As more threads are added, the amount of time the process spends waiting for checkpoint operations to finish increases.
Availability Through Process-Pairs and Monitors C and C++ should be written using pTAL; existing TAL applications can be easily migrated to pTAL (refer to the TNS/R Native Application Migration Guide). Both pTAL and TAL offer complete control of all process resources: memory, open files, trap handling, and so on. In addition, there are no restrictions on directly invoking Guardian procedures from pTAL or TAL.
Availability Through Process-Pairs and Monitors NonStop Server for Java For information about invoking system routines from COBOL85 or FORTRAN, refer to the COBOL85 Manual or the FORTRAN Reference Manual. NonStop Server for Java NonStop Server for Java allows creation of process pairs through the Java Native Interface (JNI) to C and C++. Such process pairs are subject to the constraints discussed in C and C++ on page 7-21.
Availability Through Process-Pairs and Monitors HP Process Monitors HP Process Monitors HP provides the following process monitors: • • • The Tuxedo transaction-processing environment described in Section 5, Increasing the Availability of Tuxedo Applications The PATHWAY transaction-processing environment described in Section 6, Availability in the Pathway Transaction-Processing Environment The Kernel subsystem (G-series systems only) described at the end of this section These process monitors provide a
Availability Through Process-Pairs and Monitors Availability Guide for Application Design—525637-004 7- 24 The Kernel Subsystem
8 Instrumenting an Application for Availability This section provides an overview of application error handling and instrumentation to encourage the application designer to integrate instrumentation into the initial design of the application. Written from the viewpoint of application development, this section emphasizes instrumentation of the business application itself.
Instrumenting an Application for Availability • Design Philosophy for Error Handling A discussion of automating object management on a server, using the Distributed Systems Management (DSM) subsystem to illustrate the use of instrumentation in management automation; refer to Automating Object Management on page 8-15.
Instrumenting an Application for Availability Checking for Errors application must go offline, then it should do so as gracefully as possible, leaving the database in a known, consistent state, and leaving a saveabend file for analysis. Note. The $IMON process must be running on your system in order to create a saveabend file. Always Check the Error Return Your program should check for all possible error return values.
Instrumenting an Application for Availability Writing Code to Handle Problem Errors Benign conditions, such as a user mistyping a file name or entering some out-of-range value, are taken care of simply by sending a message to the user. Attempting Recovery Many errors indicate some temporary loss of service, which can be corrected simply by retrying the operation. The period and number of retries again depends on the error.
Instrumenting an Application for Availability Writing Code to Handle Problem Errors When using the technique of looping and waiting, it is preferable to try to recover a few times, over a period, but to reduce the frequency of retries if the first two or three retries fail. 3. Tell appropriate people about the error (get help and warn users). 4. Recognise when the problem is repaired. 5.
Instrumenting an Application for Availability Writing Code to Handle Problem Errors Does the program send a message to the operator screen? Are the operators available 24 hours a day? Does the company have a pager system for operations or support personnel? How does that software work on the server system? (It is probably a matter of filtering the message and routing it to an automation program to send it to the pager broadcast system.
Instrumenting an Application for Availability What Is Instrumentation and Why Is It Necessary? What Is Instrumentation and Why Is It Necessary? Instrumentation provides the interfaces between objects in the system and operations utilities that monitor and control these objects. To facilitate this control, HP subsystems and user applications must: • • Generate events to indicate state changes in objects they control.
Instrumenting an Application for Availability How Does Instrumentation Improve Availability? How Does Instrumentation Improve Availability? The key to reducing application downtime though instrumentation is understanding what happens when a problem occurs and what needs to happen before the problem is fixed and the application is back online. Figure 8-2 on page 8-8 shows the steps involved.
Instrumenting an Application for Availability How Does Instrumentation Improve Availability? Effective instrumentation can help to provide both of these kinds of protection. Instrumenting for Failure Prevention Instrumentation for failure prevention includes: • • • Providing a command interface to monitor and control critical objects within an application.
Instrumenting an Application for Availability A Framework for Planning and Developing Your Instrumentation Instrumenting for Failure Resolution and Recovery Instrumentation can help resolve the failure and recover the application through a command interface that can alter the status of objects by starting, stopping, suspending, or activating parts of the application.
Instrumenting an Application for Availability A Framework for Planning and Developing Your Instrumentation From an understanding of these relationships or constraints, it is easier to understand the effect on the application of the failure of an object. Hence, you can determine which objects are critical to the availability of your application and should be instrumented. Many system objects are already instrumented within the subsystem that owns them.
Instrumenting an Application for Availability A Framework for Planning and Developing Your Instrumentation Define Valid Object States For each object that is critical to the availability of your application, you need to establish its valid states and the conditions that cause state changes. Valid states for all objects fall into the following general state categories: Up An object is in an up state when it is started. The object meets all of its operational objectives and can be used to provide services.
Instrumenting an Application for Availability A Framework for Planning and Developing Your Instrumentation Figure 8-4 shows an example of a dynamic model that might apply to a server process. “Running” would be considered an up state and “stopped” would be considered a down state. “Starting,” abending,” and “stopping” would all be considered odd states. Figure 8-4.
Instrumenting an Application for Availability Who to Notify? Defining the Criteria That Indicate the Health of the Application Finally, you must define the criteria that monitor the health of your application and convey this information to the human or automated operator. For example, you might keep a count of transactions, monitor queue totals, or other application resources.
Instrumenting an Application for Availability Automating Object Management inform the user that there is a temporary delay in processing. In addition, any error that requires the user to reenter data must be reported to the user. Application problems reported to the user must, of course, be worded in a way appropriate to the expected knowledge level of the user.
Instrumenting an Application for Availability Alternatives to DSM You can find more information about automating object management using HP’s implementation of SNMP in the SNMP Manager Programmer’s Guide and SNMP Subagent Programmer’s Guide. You can find a full description of DSM in the Distributed Systems Management (DSM) Manual; the following subsections discuss the use of DSM to instrument applications so that they can be managed as objects.
Instrumenting an Application for Availability • • Overview of DSM Architecture Providing high-level views of applications, systems, and networks Enabling automated operations You can realize the greatest benefit from automated operations software if your application is compatible with the DSM model and is properly instrumented.
Instrumenting an Application for Availability Overview of DSM Architecture Figure 8-5. Overview of DSM Architecture Operations Environment Management Services DSM Applications Management Applications Operator's Console Subsystem Environment HP Subsystems Objects Business Applications Objects DSM Services DSM Tools SPI VST705.vdd • The DSM subsystem environment The subsystem environment consists of HP subsystems, business applications, and their objects.
Instrumenting an Application for Availability • The Subsystem Programmatic Interface (SPI) The DSM operations environment Applications and tools in the operations environment are available to your operations staff. In addition to the commercial applications written for DSM, you can write your own custom management applications to manage HP subsystems and your business applications.
Instrumenting an Application for Availability The Subsystem Programmatic Interface (SPI) EMS Messages EMS event messages are a special category of SPI messages that convey information about events or significant occurrences in the subsystem environment. Occurrences and conditions reported by event messages include: • • • • • • Changes in the subsystem environment Errors encountered during continuous operation.
Instrumenting an Application for Availability The Subsystem Programmatic Interface (SPI) critical situations and lets the operator (with the help of programmatic tools) make the final determination. • A version identifier for the application Different versions of an application have different problem histories. Knowing the version of the application can be key in analyzing a problem and in determining a solution.
Instrumenting an Application for Availability The Subsystem Programmatic Interface (SPI) applications might rely entirely on an event message log for its input and an unreported state change could cause confusion.
Instrumenting an Application for Availability The Subsystem Programmatic Interface (SPI) This kind of event might indicate that a disk is getting full or a message queue is getting full. Or it might result from statistical data indicating, for example, that an error rate has exceeded some threshold value.
Instrumenting an Application for Availability The Subsystem Programmatic Interface (SPI) If the command message is in response to an event—as is typical when an automated operator is used—then some of the command message contents are derived from the event message as shown in Figure 8-7 on page 8-24. Thus, the subsystem identifier is copied from the event message, the command type and object type are derived from the event number, and the object name is derived from the event subject.
Instrumenting an Application for Availability The DSM Subsystem Environment You might issue this type of command in response to a file-full threshold event message indicating that the file-full threshold has been crossed. • Switch to stand-in processing. You would typically issue this kind of command when it is no longer possible to run the application normally. In response to the command, some other processing takes over, perhaps with reduced function, until the normal application is able to resume.
Instrumenting an Application for Availability • The DSM Subsystem Environment Measure programming uses counters to gather statistics about application resource usage, enabling integrated reporting of application and system statistics. Using EMS FastStart to Generate Events The EMS FastStart product provides a simple, cost-effective way for programmers to develop and test EMS event messages.
Instrumenting an Application for Availability DDL source The DSM Subsystem Environment DDL definitions based on the parameters in your ACF and on the corresponding source definitions for the C, COBOL85, and TAL programming languages. Figure 8-8 shows the EMS FastStart architecture, development and production environments. Figure 8-8.
Instrumenting an Application for Availability The DSM Subsystem Environment The following paragraphs give a brief overview of how you can help to increase the availability of your application using SPI programming to generate event messages. A set of standard event messages provided by HP is adequate for most applications. You also have the option of building event messages from scratch. For full details about using SPI procedures or EMS procedures to build event messages, refer to the EMS Manual.
Instrumenting an Application for Availability The DSM Subsystem Environment Creating a Data Definition File From the data definition file, you create SPI definitions for your tokens in the language in which your application is written. If you choose to use standard event messages, a sample DDL file is available through Infoway. You can copy this file and modify it to suit your needs.
Instrumenting an Application for Availability The DSM Subsystem Environment How SPI Programming Works At the highest level, the management interface server must: 1. Read SPI command messages from an SPI buffer and extract tokens from the command message. 2. Execute the command according to the token values contained in the command message. 3. Respond to the requester (the management application).
Instrumenting an Application for Availability DSM Management Services The counter values stored to disk can be read using Measure procedures by a management application or by the Measure product itself. Statistics gathered by Measure can be presented by the Surveyor or Enform products or further analyzed by a tool such as the Guardian Performance Analyzer (GPA). For full details on defining and using Measure counters, refer to the Measure User’s Guide.
Instrumenting an Application for Availability DSM Management Services Figure 8-9 on page 8-32 shows the flow of event messages in a system. Messages originate in the subsystem environment and are then sent by the subsystems and applications to the primary and alternate collector processes. Collectors write event messages to the log files. Distributors retrieve selected messages from the log files and send them to processes, printers, terminals, and other destinations in the operations environment.
Instrumenting an Application for Availability DSM Management Services Event Message Collectors EMS supports two types of event message collector processes: • • A primary collector Alternate collectors Each system (or node) has only one primary event message collector, named $0. It is configured during system generation and always runs as a process pair. $0 is the primary collection point for all event messages generated by all reporting subsystems in a system.
Instrumenting an Application for Availability DSM Management Services What Are EMS Filters and How Are They Used? Filters provide a mechanism for reducing message noise. They filter out messages that are of no interest to management applications that read messages from the distributor with which the filter is associated. The event log file is read by all the EMS distributor processes configured onto or started on the system.
Instrumenting an Application for Availability DSM Management Services Using DSM Template Services The DSM Template Services are used to generate text from tokens in event messages or from tokens in SPI message buffers.
Instrumenting an Application for Availability • • • The Operations Environment Which event messages should be forwarded to the primary collector and which event messages should be forwarded to which alternate collectors How filters should differentiate messages that are useful from those that are not for the associated distributor The printable text representing event tokens for reading by operations staff The Operations Environment The operations environment consists of management applications and envi
Instrumenting an Application for Availability • • • Availability Requirements of DSM Management Applications Supporting automated operations through issuing commands in response to event messages Supporting a human operator interface though the display of filtered messages and a command interface Gathering performance statistics on system resources and application resources by using the Measure product Using the functions listed above, commercial or user-written management applications must work with th
Instrumenting an Application for Availability • • • • User-Written Management Applications The backup process of a process pair has encountered the same failure condition encountered in the original primary. A Pathway server process, in spite of numerous retry attempts, repeatedly returns errors to a requester. A communications link with a remote server process has gone down. A hardware component and its backup have simultaneously failed.
Instrumenting an Application for Availability User-Written Management Applications 3. Send the command to the subsystem using regular Guardian interprocess communication. 4. Retrieve the response returned by the subsystem from the SPI message buffer and use SPI procedures to extract the tokens. Using SPI to Retrieve Events From an EMS Consumer Distributor To retrieve events from a consumer distributor, your management application must: 1. Start and establish a connection with a consumer distributor. 2.
Instrumenting an Application for Availability DSM Management Tools and Performance Measuring Tools DSM Management Tools and Performance Measuring Tools The DSM operations environment includes management tools that can extend the control of the operations environment. In addition, performance management tools can also be key to predicting problems by identifying bottlenecks in the system or in problem analysis.
Instrumenting an Application for Availability • DSM Management Tools and Performance Measuring Tools Testing user-generated EMS events; you can display all tokens in a message, thus verifying the correctness of events generated by the business application For complete details about EMS Analyzer, refer to the Event Management Service (EMS) Analyzer User’s Guide and Reference Manual.
Instrumenting an Application for Availability DSM Management Tools and Performance Measuring Tools When an object does not respond to the configured settings, an informative statement, action attention, or critical event message is routed to an EMS collector. For complete details on OMF, refer to the Object Monitoring Facility (OMF) User’s Guide and Reference Manual. Subsystem Control Facility (SCF) SCF is used to configure, control, and collect information about HP data communication subsystems.
9 Minimizing Programming Errors For an application to be continuously available, you need to ensure that the code you write is as free from defects as possible. The hardware and software design of NonStop systems and appropriate design of the application go a long way toward keeping an application running. However, fault-tolerant design is not a substitute for quality, error-free code. Fault-tolerant design does not protect you against most deterministic errors in your application program.
Minimizing Programming Errors Shared Run-Time Libraries Reusable modules come in two forms: • • Shared run-time libraries that can be used for programs written in traditional languages such as C and pTAL Reusable objects appropriate for object-oriented languages such as C++ or Java Shared Run-Time Libraries Shared run-time libraries (SRLs), sometimes called shared resource libraries, are object files used by more than one process at a time.
Minimizing Programming Errors Design Methodology for Eliminating Software Faults Object-oriented techniques make use of the concepts of object, class, and inheritance to support reusable modules that reduce the propagation of errors. Here, a new object class is built from an existing object class without modifying the existing code. The new code might contain attributes and behavior in addition to those inherited from the original object class, but the inherited code itself is not changed.
Minimizing Programming Errors Checklist for Common Errors 2. Missing operations These problems include operations that should have been performed but were not. For example, pointers and other variables were not initialized, data was not updated, interprocess communication was missing. 3. Data errors These errors result in an incorrect constant or variable used within the program. 4.
Minimizing Programming Errors • • • • • • • • • • • • • • • • • Checklist for Detecting Errors Are all related data structures updated on the occurrence of an event? Are all related processes notified about the occurrence of an event? Are there potential race conditions between vulnerable time windows? Are possible hardware errors or timeout situations properly handled? Are all error-handling paths testable? Is there protection against invalid parameters on procedure calls? Can software operate on all po
Minimizing Programming Errors Checklist for Detecting Errors An eye-catcher is a text string or other constant value placed in a critical data structure. Your program can check that the data structure contains the text string to increase confidence that the data structure is valid.
Minimizing Programming Errors Development Methodology A corrupted counter can cause a program to behave unpredictably. You can minimize damage, however, by checking the values of counters. You might do this after each increment, or you might do it periodically and check that, for example, a counter that starts at zero and increments by 7 is always a multiple of 7. • Check all options when selecting from a choice of possibilities, and then provide a default option.
Minimizing Programming Errors Development Methodology development phase. A defect found in production typically costs about 100 times what the same defect would cost to fix had it been caught during development—and that does not include additional costs of lost services or revenue or loss of reputation caused by the defect. Figure 9-1. Relative Costs of Fixing Defects Development $X Quality Assurance Production $10X $100X VST801.
Minimizing Programming Errors Specify and Review the Requirements helping the designers and developers to organize their ideas, sharing information and code, and facilitating personnel changes. Specify and Review the Requirements The first phase of developing any application must be to specify what you are building in terms of the goals and objectives of the product and to write it down in a requirements specification.
Minimizing Programming Errors High-Level Testing application has several drawbacks. Much of the code will probably remain untested and is likely to contain errors. In addition, errors that are discovered by high-level tests are hard to fix because high-level tests do not always indicate the module in which the problem lies.
Minimizing Programming Errors • • • • • • • • • High-Level Testing The application continues to operate correctly when a hardware fault occurs. For process-pair applications, correct data is checkpointed so that regardless of when a failure might occur, the application can correctly continue where it left off. Appropriate error messages and event messages are generated when a failure occurs. The application operates correctly when failed components become operational again.
Minimizing Programming Errors High-Level Testing Other High-Level Testing Other high-level tests that you might design and run depend on the needs of your application. For example, whether you should perform security testing depends on whether security is an issue. Refer to a recommended text for details on the following types of tests: • Function testing Function testing is a process of trying to find inconsistencies between the product and its external specification.
10 Designing Applications for Change Upgrading an application can be a time-consuming effort for several reasons: • • Significant changes in strategic or design requirements can delay completion of an upgraded application for months or years. Delayed completion can lead to reduced application availability when the existing version of the application has insufficient capacity for its users or does not support urgently needed new features.
Designing Applications for Change • • • • • Considering Portability Requirements Using a modular design of application and data for ease of distribution Using version-labeled interfaces for intermodule communication Supporting implementation through techniques such as avoiding dedicated names or using variable or extensible procedures Explicit action to enable upgrade Handling changes in initialization information The following paragraphs discuss these techniques.
Designing Applications for Change Considering Portability Requirements Individual APIs The NonStop Software products (principally Windows NT Server versions of NonStop SQL/MP and NonStop Tuxedo) provide a high degree of isolation for applications from the underlying platform differences.
Designing Applications for Change Considering Portability Requirements management system. Many COBOL compilers also provide extensions to cover features such as screen handling or message queuing specific to the platform for which they were developed. Using a client/server model with service-based and framework-based server logic (described in Preserving Investments Through Services and Frameworks on page 10-12) can isolate many of these differences from the body of the business logic.
Designing Applications for Change Considering Portability Requirements database management systems, the initiative of the X/Open SQL group has given rise to products such as Microsoft Open Database Connectivity (ODBC). ODBC access is provided today for NonStop SQL/MP. Support for these client tools will be provided in NonStop SQL/MX for both the NonStop operating system and Windows NT Server.
Designing Applications for Change Considering Portability Requirements particularly at the more complex lower layers, is normally the task of specialized infrastructure subsystems that represent more compact porting challenges. When implementing an application with a platform-specific version of a communication subsystem, the greatest possible portability problem arises from the semantics of asynchronous request mechanisms rather than the simple syntax of the API commands.
Designing Applications for Change Considering Portability Requirements Uniprocessor systems Uniprocessor systems simply have a single processor with associated memory. Interprocess communication is typically based on shared memory. This is typified by the UNIX system API, which provides several interprocess communication mechanisms, all of which are implemented using shared-memory techniques.
Designing Applications for Change Considering Portability Requirements within the single-image APIs of the supporting platform, as they are in the NonStop operating system file system and NonStop Transaction Services/MP (NonStop TS/MP) process management. Developing For the Platform Architectures To develop an application capable of exploiting any of these platform architectures requires avoiding direct use of shared-memory structures within the application.
Designing Applications for Change Considering Portability Requirements Increasingly, OLTP applications require 24 x 7 availability, with the elimination of traditional batch windows. This requires that batch update programs implement efficient logical units of work and locking strategies to coordinate with concurrent OLTP operations on the same database.
Designing Applications for Change Considering Portability Requirements Process Pools For clustered systems, the preferable model would be the requester/server class model provided by NonStop TS/MP on HP NonStop servers. This model allows scalability by adding instances to the server class and executing the instances across all the available processors of the system. The instances are shared between users, with one instance being allocated for the duration of each request.
Designing Applications for Change • • • • Considering Portability Requirements SEER*HPS and TI Composer The Dynasty development environment Connectivity middleware The HP Application Development Environment (ADE) SEER*HPS and TI Composer These products provide integrated toolsets for multiple platforms, including the NonStop operating system and Windows NT Server.
Designing Applications for Change • Considering Portability Requirements HP extensions to workstation editors and debuggers. Isolating the Unportable Regardless of which engineering approach to portability is taken, it may be necessary to include nonportable function calls. It is essential to isolate the mainline application code from these items. In effect, a local API must be invented to give access to an easily replaceable library that maps that API to the required platform-specific functions.
Designing Applications for Change Considering Portability Requirements In addition to the mapping of call and reply syntax, a framework might need to provide other services to standardize such things as transfer of control, error reporting, and statistic recording.
Designing Applications for Change Considering Portability Requirements The NonStop operating system platform contributes scalability and portability to the Windows NT Server platform in the following key ways: • • HP contributes substantial technology derived from the NonStop operating system to the Microsoft Server Cluster Server (MSCS). This technology is part of the standard MSCS product and delivered with the Windows NT Server product from Microsoft.
Designing Applications for Change Isolating Data From the Application now use shared run-time libraries (SRLs), where context can be defined for the SRL within the address space of the calling process as part of the object file construction. This SRL function requires the use of native mode compilation tools. Windows NT Server supports copy-on-write dynamic link libraries.
Designing Applications for Change Isolating Data From the Application Figure 10-1. Isolating Data From the Application Through Data Encapsulation Module 1 Module 2 Module 3 Data Interface Module . . . Module n 01 Data Item1 01 Data Item2 02 Subitem1 . . . VST901.vdd Updating the Interface Module As a step towards continuous operation of the application, you must allow the old version of the interface module to run concurrently with the new version.
Designing Applications for Change Isolating Data From the Application Figure 10-2. Indirect Addressing Allows Multiple Versions of the Data-Interface Module to Be Used Simultaneously Address of Data-Interface Module Module 1 . . . Module n Module n+1 First contains the address of the original version of the data-interface module when called by Module 1 through Module n. This address is updated to point to the upgraded data-interface module and then called by Module n+1 through Module n+n.
Designing Applications for Change Using a Modular Design Under What Circumstances Does Data Encapsulation Work Well? Data encapsulation is a structured programming technique that will make future changes to the data layout easier, in most cases. Depending on the tools you are using, the functions performed by your application, and the specific details of the anticipated change, data encapsulation works better in some situations than in others.
Designing Applications for Change Using Version-Labeled Interfaces for Intermodule Communication As your business grows, the resources used by your application might get distributed across several nodes in a network. In addition, you might have replaced the terminals you used to use as your primary user interface with intelligent workstations.
Designing Applications for Change Using Version-Labeled Interfaces for Intermodule Communication messages and no longer have to ignore any fields. However, the modules still perform with version N functions. 4. Version N+2 of module B is installed. Module B now contains the upgraded function and is able to make use of the new fields in version N+1 messages. Module A knows of the existence of these new fields but not of their use. However, module A is able to intelligently reject their use. Note.
Designing Applications for Change Using Version-Labeled Interfaces for Intermodule Communication receives its messages. However, so long as they have at least one message format in common, they can communicate. Figure 10-4 shows two processes in communication. One process supports message versions A through H. The other supports message versions F through M. Figure 10-4.
Designing Applications for Change Supporting Implementation Supporting Implementation Many implementation support techniques you can use make application upgrade easier and enhance the availability of your application. Such techniques include: • Using extensible and variable procedures Usng such procedures makes it easier to add functions to your procedures. Refer to Using Extensible and Variable Procedures, following.
Designing Applications for Change Explicit Upgrade Action: Handling Changes in Initialization Information Procedures defined with the extensible attribute allow parameters to be added later. This feature is particularly useful for parameters that you do not yet know about. You simply declare the procedure as extensible so that you can add as many parameters to it as you like at a later time. Again, when you do add parameters to the procedure or function, existing callers do not need to change.
Designing Applications for Change Changing a NonStop TS/MP Application Again, the application must perform some reinitialization in order to read in the new exchange rates. Application managers often reinitialize this kind of information by stopping and restarting the application. This approach works fine if you have a period of inactivity in which to do this. If you need operations 24 hours a day, 7 days a week, however, you cannot stop your applications.
Designing Applications for Change Using the Trickle Catchup Approach Upgrading a NonStop Tuxedo Service In NonStop Tuxedo applications, you can upgrade a service without any interruption to client processes. It is possible to do this because the same service can be supplied by more than one Tuxedo server and each Tuxedo server is configured as a separate NonStop TS/MP server class. Client processes invoke servers using a service name.
Designing Applications for Change Upgrading a TCP Requester Refer to Section 4, Data Protection and Recovery, for additional details. Upgrading a TCP Requester Programs that execute under the TCP can be automatically upgraded to a new version if the flags on the COBOL directory are set up to do so. When the COBOL compiler generates a new version of a program, it adds the new version to the program file and adds a corresponding entry in the COBOL directory.
Designing Applications for Change Changing a NonStop SQL/MP Program or Database 3. A new version of PROGC is compiled and added to the program file. PROGD PROGA PROGB PROGC PROGE PROGF VST900.vdd Because PROGC loops indefinitely, it never returns to PROGB and, therefore, is never called after its first invocation. The old version of PROGC therefore continues to execute.
Designing Applications for Change • • • Physically Reconfiguring the Database Changing the objects that a SQL program accesses; for example, accessing a different table with the same program Upgrading the SQL program to a new revision level Making changes to the objects that the application accesses; for example, logical changes such as adding a column to a table or physical changes such as moving a table or adding an index Traditionally, operations such as these require you to bring the application dow
Designing Applications for Change Physically Reconfiguring the Database Without the HP Implementation Without the HP implementation, a database can become unavailable for an unacceptable period of time while physical database reconfiguration takes place. When moving a partition, for example, you typically need to stop accessing the partition before the move starts, and not start accessing it again until the move is finished.
Designing Applications for Change Physically Reconfiguring the Database operations when the number of operations against the database is low. You can specify when this phase executes by using special commit criteria. The application can now safely use the new partition. Application downtime can be further reduced by using execution-time name resolution to access the new partition.
Designing Applications for Change Physically Reconfiguring the Database and additional examples of the features described here, refer to the appropriate NonStop SQL/MP programming manual. A similarity check enables the SQL executer to determine which parts of an application need to be recompiled as a result of a change. Unnecessary recompilation is thus reduced, and the availability of the application increases. By default, no similarity check is done.
Designing Applications for Change Physically Reconfiguring the Database execution plans that pass the similarity check might not execute as efficiently as those that are recompiled for their new objects. The COMPILE PROGRAM option has an optional STORE SIMILARITY INFO clause that causes similarity information to be saved for future compilations.
Designing Applications for Change Execution-Time Name Resolution Execution-Time Name Resolution The NonStop SQL/MP product allows you to dynamically resolve the names of SQL objects referenced in SQL statements as the program is executing. This feature is useful, for example, when executing the same transaction code against different SQL tables.
Designing Applications for Change Installing a New Program Version Figure 10-5.
Designing Applications for Change Installing a New Program Version Program Upgrade Without the REGISTERONLY or NOREGISTER Options Traditionally, a new version of a SQL program is typically developed on a development system from where it is copied to the production system when the development and test cycles are finished. On the production system, the program must be registered in the SQL catalog and recompiled to refer to the set of SQL objects that reside on the production system.
Designing Applications for Change Recompiling or Execution-Time Name Resolution for Data Definition Changes Figure 10-6.
Designing Applications for Change Recompiling or Execution-Time Name Resolution for Data Definition Changes Without Similarity Checks Without similarity checks, the entire SQL program would be invalidated, requiring complete SQL recompilation or automatic recompilation of all SQL statements at execution time. With Similarity Checks With similarity checks, you have the options of recompiling or automatically recompiling only those statements that fail the similarity checks.
Designing Applications for Change Changing a NonStop Process Pair Figure 10-7. Increasing Application Availability During DDL Changes Development System Compile the program with SQL using CHECK INOPERABLE PLANS Catalog Production System Install the program using REGISTERONLY ON Catalog Enable the similarity check for the table using ALTER TABLE Change DEFINE to reference Table T2 Execute the program Table T2 Perform DDL operation (No automatic recompilation if similarity check passes) VST907.
Designing Applications for Change • Replacing the Process Libraries used by the process The same techniques allow a process pair to be downgraded when fallback to an earlier version of an application is required.
Designing Applications for Change Replacing the Process The technique depends upon both V1 and V2 versions of the application program providing the following: • Active rather than passive backup This technique is unsuitable for many passive-backup process pairs because they do not use restart checkpoints. Restart checkpoints are needed to contain the current procedure call stack of the primary process, which contains the addresses of functions called within the current object file.
Designing Applications for Change Replacing the Process a. The primary process requests that the backup process use the PROCESS_STOP_ procedure to terminate itself. b. The backup process terminates. The application is no longer running as a process pair. c. The primary process starts the V2 code in processor 2, specifying in the PROCESS_CREATE_, PROCESS_LAUNCH_, or __ns_start_backup() function that the new process be its backup process. d. The V2 backup process starts. e.
Designing Applications for Change Replacing a Library The global-switch technique depends upon both V1 and V2 versions of the application program being coded such that they provide the following: • • • The equivalent of a transaction monitor to receive and assign user requests to either the V1 or V2 versions of the application. This process must be coded such that requests rejected by the V2 version are routed to the V1 version until the V1 version has no outstanding requests.
Designing Applications for Change Replacing a Library Forcing Process Migration This technique is identical to that discussed under Replacing the Process on page 10-39, except that it is the library bound to each version of the application process that differs, not the application code itself. By forcing a new process pair that uses the new library to take over the functions of the old process pair, you also force migration from one version of the library to the other.
Designing Applications for Change Availability Guide for Application Design—525637-004 10-44 Replacing a Library
Glossary ACF. See application configuration file (ACF). action event. An event that requires operator intervention to resolve. Each subsystem determines which events are action events by including a unique Event Management Service (EMS) token in the event message. Action events are reported as pairs of event messages: an action-attention message to report the problem and an actioncompletion message to report the problem resolution.
Glossary audit record audit record. A before-image, an after-image, or control information for a transaction. Audit records are stored in audit-trail files. audit trail. A record of database changes that can be used by the TMF product to rebuild a database in the event of a hardware or software failure. audit-trail file. A disk file containing audit records for transactions that make changes to audited disk files. availability.
Glossary catalog BBL receives configuration and status information from the DBBL and updates the bulletin board accordingly. The BBL is replicated on every CPU that supports the application. See also bulletin board (BB) and distinguished bulletin board liaison (DBBL). catalog. A set of tables containing the descriptions of SQL objects such as tables, columns, indexes, views, files, and partitions. change management. The process of managing the maintenance and growth of your NonStop system.
Glossary command file command file. A file that serves as a source for command input. For example, users can prepare a command file containing PATHCOM or COBOL Utility Program (SCUP) commands. They can then cause the commands in the file to be executed by issuing the PATHCOM or SCUP OBEY command and specifying the name of the file. Alternatively, they can specify this file as the input file when they execute PATHCOM or SCUP. Common Run-Time Environment (CRE).
Glossary CRE with the address at which the backup must start processing if the primary stops. See also data-state information and process pair. CRE. See Common Run-Time Environment (CRE). critical event. A Distributed Systems Management (DSM) event that is considered to be crucial to the operation of the system or network or that could have potentially serious consequences, such as the loss of a device.
Glossary DDL DDL. See data definition language (DDL). defensive programming. The process of enhancing a software design methodology by adding checks that test that the program is operating the way it was designed or by adding checks that catch anomalous conditions where the program is not operating as designed. design outage class. An outage class that includes bugs in design and design failures in hardware and software.
Glossary distributor process monitor and control systems and networks from a single terminal. DSMS is particularly useful for managing subsystems and their objects. distributor process. An Event Management Service (EMS) process that distributes event messages from event logs to requesting management applications, to console message destinations, to specific processes in the ViewPoint application, or to a collector on another node.
Glossary Dynamic System Configuration (DSC) utility Dynamic System Configuration (DSC) utility. An HP utility that allows users to reconfigure an operating system online, without having to run Install to create a new system. EGEN. A TAL routine generated by Event Management Service (EMS) FastStart from an application configuration file (ACF) to provide a high-level interface for building EMS event messages. See also Event Management Service (EMS) FastStart and application configuration file (ACF). EMS.
Glossary event message event message. A special kind of SPI message that describes an event occurring in the system or network. Event messages are collected, logged, and distributed by EMS. See also Event Management Service (EMS). Expand software. The HP NonStop network that extends the concept of fault-tolerant operation to networks of geographically distributed NonStop systems.
Glossary front-end process front-end process. A process that serves as the intermediary between an application process and communication lines providing connectivity with front-end devices such as terminals and point-of-sale devices. Contrast with back-end process. gateway process. A process, such as the Transaction Delivery Process (TDP) that is part of RSC/MP, that manages communications between dissimilar environments (for example, a workstation and an HP system).
Glossary HP NonStop operating system HP NonStop operating system. The operating system for NonStop systems, which consists of the core and system services. The operating system does not include an application program interface. HP Object Monitoring Facility (OMF). An HP product that enables operators to supervise objects such as processors, disks, files, and processes within the HP environment. IDS. See intelligent device support (IDS). I’m alive message.
Glossary intelligent device support (IDS) intelligent device support (IDS). A feature of Pathway that allows COBOL requester programs to interact with external (other than Pathway) processes that, in turn, control intelligent devices such as personal computers, automated teller machines, and pointof-sale devices. interprocess communication (IPC). The exchange of messages between processes in a system or network. IOP. See input/output process (IOP). IPC. See interprocess communication (IPC). LAN.
Glossary MasterLan II-T2 Adaptor Card MasterLan II-T2 Adaptor Card. A dual-ported network adaptor that helps to provide a faulttolerant connection between a PC and a network hub. master machine. The master machine for a NonStop Tuxedo application as designated in the configuration file. The master machine contains the NonStop Tuxedo administrative programs, the master configuration file, and the DBBL process. All administration of the NonStop Tuxedo application is done from the master machine. Measure.
Glossary Network Control Language (NCL) programmers from having to write programs for specific LAN protocols. See also Multilan. Network Control Language (NCL). A structured, high-level language that is part of the NonStop NET/MASTER product. NCL is ideally suited to writing procedures for automated system and network operation tasks, such as monitoring system status and submitting commands to subsystems. Network Statistics Extended (NSX).
Glossary NonStop SQL/MP NonStop SQL/MP. See NonStop Structured Query Language/MP (NonStop SQL/MP). NonStop Structured Query Language/MP (NonStop SQL/MP). The HP relational database-management system that promotes efficient online access to large distributed databases. NonStop TCP/IP. A reliable communication protocol between NonStop systems and workstations or other systems over an Ethernet connection. NonStop Testing.
Glossary object object. Any entity subject to independent reference or control by one or more subsystems. Examples of objects are devices, communications lines, processes, and files. In Distributed Systems Management (DSM), every object has a name and type. object state. In DSM, the current condition of an object that indicates its readiness to do work. odd state.
Glossary operations outage class operations outage class. An outage class that includes errors caused by operations personnel due to accidents, inexperience, or other actions. operator message. The text displayed for a system operator that describes an event. OSS. See Open System Services (OSS). OSS environment. The Open System Services (OSS) API, tools, and utilities. outage. Time during which the NonStop system is not capable of doing useful work because of a planned or unplanned outage.
Glossary Pathsend requester Pathway server. A Pathsend process can be either a standard requester, which initiates application requests, or a nested server, which is configured as a server class but acts as a requester by making requests to other servers. A Pathsend process is also known as a Pathsend requester. Pathsend requester. See Pathsend process. Pathway application. A set of programs that perform online transaction-processing tasks in the Guardian environment, using interfaces defined by HP.
Glossary physical outage class down. Configuring phantom devices is one way you can plan for future growth. Disk drives are commonly configured as phantom devices. physical outage class. An outage class that includes physical faults or failure in the hardware. Any type of hardware component failure belongs in this category. planned outage. Time during which the system is not capable of doing useful work because of a planned interruption.
Glossary RDF RDF. See Remote Duplicate Database Facility (RDF). reconfiguration. See reconfiguration outage class. reconfiguration outage class. An outage class that includes all planned outages. Examples include downtime required for planned maintenance (such as software upgrades) and configuration changes (such as adding a new disk or restructuring a database). recovery. The returning of a database file back to a consistent state. reduced instruction-set computing (RISC).
Glossary resource manager resource manager. A subsystem that manges some transactional objects; for example, a transactional database manager or a transactional queue manager. restart checkpoint. A checkpoint operation that includes copying a program address to the backup process. This address becomes the restart point in the backup process if the primary process of a process pair fails. retryable operation. An operation that can be safely tried again if the outcome of earlier attempts is unknown.
Glossary SNAX product family SNAX product family. The product family that consists of those HP software products that provide access through a reliable communication protocol to IBM Systems Network Architecture (SNA) networks. SPI. See Subsystem Programmatic Interface (SPI). SQL. See Structured Query Language (SQL). SQL object. An entity that is created, manipulated, or dropped by SQL statements and that is described in an SQL catalog.
Glossary Subsystem Programmatic Interface (SPI) Subsystem Programmatic Interface (SPI). A set of procedures for building and decoding commands, responses, and event messages. sync depth. A parameter that sets the maximum number of operations or messages that a process is allowed to queue before action must be taken or a reply must be performed. synchronization block. A data block containing a synchronization identifier.
Glossary TMDS TMDS. See HP Maintenance and Diagnostic System (TMDS). TMF. See token. TNS/E. Another name for the Y-series servers. TNS/R. Another name for the S-series servers. token. In the Subsystem Programmatic Interface (SPI), a distinguishable unit in a message. Programs place tokens in an SPI buffer using the SSPUT procedure and retrieve them from the buffer with the SSGET procedure. A token has two parts: an identifying token code and a value. TorusNet.
Glossary transaction-processing monitor (TP monitor) transaction-processing monitor (TP monitor). An entity that provides a transactionexecution environment on top of the operating system. See also Pathway transactionprocessing environment, and NonStop Tuxedo System /T. transient error. A hardware or software error that is difficult to reproduce or that occurs in an unpredictable way. These errors typically occur only when a specific set of conditions coincide.
Glossary updating state information updating state information. The act of passing control-state information and data-state information to the backup process of an active backup process pair. Contrast with checkpoint. See also control-state information and data-state information. up state. An object that meets all its operational objectives and can be used by the application to provide service. See also down state, odd state, and unknown state. variable procedure.
Glossary X.25AM X.25AM. An HP product that provides a reliable communication protocol between NonStop systems and other open systems using the X.25 protocol. $RECEIVE. A special Guardian file name through which a process receives messages from other processes and optionally makes replies.
Glossary $RECEIVE Availability Guide for Application Design—525637-004 Glossary-28
Index Numbers 27515 N1 num list 3. The client process makes a request for a service.
Index C file-state information 7-15 file-synchronization information 7-9 opening files in 7-15 operating system messages, receiving 7-9/7-10, 7-15 processing information received from primary 7-14/7-15 process-state information 7-8, 7-14/7-15 starting 7-6, 7-12/7-13 takeover from primary 7-10/7-12, 7-15/7-16 Batch processing See also Batch windows, eliminating availability with 1-26 concurrent with online operations 4-24/4-31 measuring availability 1-7 Batch windows, eliminating database snapshot 4-24, 4-
Index C file-synchronization information 7-7, 7-8 FILE_OPEN_CHKPT_ procedure 7-7 I/O process 2-13/2-15 multithreaded process considerations 7-19/7-20 nonrestart checkpoint 7-8 NonStop Tuxedo applications 5-3 open applications 5-3 Pathsend requester 6-22/6-23 Pathway applications 6-3, 6-29 process replacement considerations 10-40 restart address 7-6, 7-7/7-8 synchronization blocks 2-13/2-15, 7-7 CHECKPOINT[MANY][X] procedure 7-7/7-8 Checksum operations 2-12 CHECKSWITCH procedure 10-41 Chemical plant proces
Index D Continuous commerce 1-2/1-3 CONTROL SPI command 8-39 Controllers, availability features of 2-6 Control-state information checkpointing 7-6 processing in backup 7-15 purpose of 7-5 receiving in backup 7-9 sending to backup 7-14 Conventional TCP/IP 3-15 Conversational server, NonStop Tuxedo 5-6 Cost containment 1-3/1-4 Cost reduction 1-5 Costs of downtime 1-3/1-4 Critical events 8-20 CSS, purpose of 6-13 C++ availability features 1-31 detecting errors in 9-7 Pathsend requester written in 6-18 proces
Index D Double modular redundancy 2-23 Downtime causes of 1-7/1-10 costs of 1-3/1-4 design outage 1-9 environmental outage 1-9 LAN 1-10 measuring 1-5/1-7, 2-18 network outage, due to 1-10 NonStop SQL/MP, avoiding with 4-18/4-19, 10-27/10-38 operational outage 1-9 operations costs incurred by 1-4 penalties incurred by 1-4 physical outage 1-8 productivity loss incurred by 1-4 RDF, due to 4-13 reconfiguration outage 1-10 reputation costs incurred by 1-4 revenue lost from 1-4 support costs incurred by 1-4 DSC
Index E moving a partition 10-27/10-38 NonStop SQL/MP 10-27/10-38 NonStop SQL/MP program 10-34/10-35 NonStop TS/MP applications 10-24/10-27 procedures names 10-22 requester/server applications 10-24/10-27 server class 10-25 similarity check 10-28, 10-30/10-32, 10-37/10-38 splitting a partition 10-27/10-38 structured programming techniques 10-1/10-24 TCP requester 10-26/10-27 trickle catchup 10-25 variable procedures 10-22/10-23 version-labeled interfaces 10-2, 10-19/10-21 E Earthquake 1-9, 4-12/4-18 EGEN
Index F version identifier 8-21 EMS procedures 8-27/8-28 EMSTEXT DSM Template Services procedure 8-35 ENDTRANSACTION Guardian procedure 4-12 END-TRANSACTION COBOL statement 4-12 Enform 8-31, 8-39 enhancements 3-19 Enscribe 4-18 Ensure 6-27 ENTER TAL ABORTTRANSACTION COBOL85 statement 4-12 ENTER TAL BEGINTRANSACTION COBOL85 statement 4-12 ENTER TAL ENDTRANSACTION COBOL85 statement 4-12 Environmental outage 1-9 Error checking 2-10, 2-10/2-12 Error containment 2-3, 2-8/2-10 Error correcting codes 1-8, 2-11/2
Index G Filters, EMS 8-34 Fire 4-12/4-18 First failure data capture 2-16/2-17, 8-9 Flood 1-9, 4-12/4-18 FML 10-21 Foreign tokens 8-26, 8-28 FORTRAN active backup process-pair model 7-21 passive backup process-pair model 7-21 process pairs, used for coding 7-21 Function testing 9-12 FUP RELOAD command 4-18, 10-28 Future-proofing 1-21 G Generic Common Gateway Interface (CGI) Server 3-16 GETEVENT EMS command 8-39 GETSYNCINFO procedure, opening a file in the active backup 7-16 Globalization, effects on avail
Index I I’m alive message 2-8/2-9 LAN controller 2-21 lock-stepped processors 2-11/2-12 logic boards 2-11/2-12 loosely coupled architecture 2-3, 2-8/2-9 mirrored disk 2-2, 2-7/2-8 parallel architecture 2-3, 2-4/2-8 parity checks 2-11/2-12 power supplies 2-9 process pair 2-12/2-15 process pair, system 2-3 processor, availability 2-8/2-9 redundant components 2-3 self-testing hardware components 2-10 system process pair 2-3, 2-12/2-15 transient error 2-12 instrumentation 2-16/2-17, 8-1/8-42 HPS/R system faul
Index L object status 2-16, 8-9 performance measurement 8-9, 8-26, 8-30/8-31 purpose of 1-19/1-20, 8-1, 8-7/8-9 SPI messages, used with 1-19, 2-16 state change 8-10, 8-12/8-13 statistics gathering 8-26 generating 8-30/8-31 system components, of 2-16/2-17 system software 2-3 thresholds 1-19, 2-16, 8-9, 8-22 Integration testing 9-10 Intermodule communication dynamic upgrade, techniques for 10-2, 10-19/10-21 FML 10-21 NonStop Tuxedo 10-21 overlapping message ranges 10-20/10-21 phased approach 10-19/10-20 Int
Index N Mirrored disk 4 disk mirror 2-8 availability of 1-11, 2-2, 2-7/2-8 failure of 1-8, 2-7/2-8 MTBF 2-2 performance of 1-11, 2-2, 2-7 reading from 2-7 writing to 2-7 Modular design, dynamic upgrade, techniques for 10-2, 10-18/10-19 Monitor process 1-15/1-16, 7-22 Monitor process pair, design considerations 7-22 MTBF 2-2 Multilan, availability features 1-30 Multithreaded process Pathsend requester 6-23 process pair considerations 7-2, 7-19/7-20 N Name resolution, execution time 10-33/10-34 Name server
Index N HP NonStop Server Customization Kit 3-18 WS Plug-in 3-18 XA resource manager 3-18 NonStop Software 10-3, 10-4, 10-5, 10-10, 10-11, 10-14 availability features 1-27 NonStop Software product set 10-14 NonStop Software (NSS) 1-33 NonStop SQL compiler options 10-31/10-33 NonStop SQL/MP adding a column 4-19 adding a disk 4-19 adding a log 4-19 adding a process 4-19 adding a table 4-18, 4-19 adding an index 4-19, 10-27/10-38 automatic recompilation 10-32 availability features 1-28, 4-1, 4-18/4-19, 10-27
Index N NonStop TS/MP server 6-5/6-6 NonStop Tuxedo applications 5-3, 5-7 online dumps 4-6, 4-9 open applications 5-3 Pathsend applications 6-21/6-22 Pathway applications 6-3, 6-29 POET API 4-10/?? queue files, with 4-20 RDF, database synchronization 4-15 record locking 4-4 recovery from failure of server process, role in 5-7 row locking 4-4 RSC/MP applications 4-11, 6-15 serializable transactions 4-4 TCP requester 6-27 transaction APIs 4-10 transaction backout 4-6/4-7, 4-8 transaction coordination 4-4/4-
Index O NonStop Tuxedo applications asynchronous communication 5-6 availability concepts 5-2/5-3, 5-6/5-14 BBL process 5-7, 5-10 client process ATMI API 5-12 error handling 5-12 failure of 5-7, 5-10/5-11 recovery from failure of 5-10/5-11 recovery logic 5-12 saving context 5-12 TX API 5-12 communications failure 5-7, 5-11/5-12 configuration file 5-13 configuring links 5-13 DBBL process 5-7, 5-10 design implications 5-12/5-14 development concerns 5-12/5-13 Expand network 5-4 limits, configuring 5-13 machin
Index P analyzing 1-22/?? client/server network 2-18/2-19 collecting data 1-23 design 1-9 environmental 1-9 goals, establishing 1-22 LAN 1-10 machine failure 5-7, 5-11 network 1-10, 2-18/2-19 operational 1-9 physical 1-8 reconfiguration 1-10 reducing, strategy for 1-22/?? P Parallel architecture 2-3, 2-4/2-8 Parallel Library TCP/IP and TCP/IPv6 3-15 Parity checks 1-8, 2-11/2-12 Partition, NonStop SQL/MP 10-27/10-38 Pascal, Pathsend requester written in 6-18 Passive backup process-pair model 7-1, 7-6/7-12
Index P checkpointing 6-22/6-23 design implications 6-23/6-24 development concerns 6-23 error handling 6-23 failure of 6-20 gateway, as a 1-18 immediate persistence 6-21 initialized persistence 6-22 instrumentation of 6-24 message control 1-18 multithreaded process 6-23 neither persistent nor process pair 6-21 nonretryable operations 6-23 operational concerns 6-24 Pathway environment, in the 6-2 persistent process 1-18, 6-23 process pair 1-18, 6-22/6-23 restarting 6-24 retryable operations 6-23 Pathsend w
Index P Portability requirements 10-2/10-15 Power fault, cause of environmental outage 1-9 Power supplies, availability features of 2-9 Power-fail testing 9-11 PRIMARY PROCESS Kernel subsystem SCF command 10-40 Primary process, failure of 7-10, 7-15 Procedure names, dynamic upgrade 10-22 Process control, chemical plant 1-5 Process Deletion message 7-9, 7-15, 7-22 Process monitor 1-15/1-16, 7-22 Process monitor pair 7-2, 7-22 Process pair 7-1/7-22 See also Checkpointing active backup model 7-1, 7-12/7-18 a
Index Q See P-switch Process-pair replacement 10-39 Process-state information checkpointing to backup 7-6/7-7, 7-13/7-14 processing in the backup 7-14/7-15 receiving in the backup 7-8/7-10, 7-14/7-15 PROCESS_CREATE_ procedure 7-6, 7-16, 10-41 PROCESS_LAUNCH_ procedure 10-41 PROCESS_SETINFO_ procedure 7-16 Productivity loss from downtime 1-4 Programming errors analyzing 8-14 benign conditions 8-3, 8-3/8-4, 8-14 C 9-7 CASE statement 9-7 checklist for 9-4/9-5 common causes of 9-3/9-4 communications input/out
Index R availability features 1-28, 4-1 database snapshot 4-29 eliminating batch windows 4-29 file recovery and database synchronization 4-15 NonStop TMF and synchronization 4-15 remote duplicate transactions, comparison with 4-17 Reconfiguration outage 1-10 Redundant components 2-3 REFRESH-CODE TCP command 10-26 reintegration 2-25 Reliability, definition of 1-1 Remote duplicate database batch windows, eliminating 4-29/4-31 environmental outage, recovery from 1-9, 4-12/4-18 Remote duplicate transactions c
Index S recovery, logic for 6-16, 6-17 restarting transactions 6-16, 6-17, 6-17/??, 6-17, ??/6-17 server process, failure of 6-14 TCP/IP connection 6-15 TDP, failure of 6-14 transactions, role in 6-15, 6-15 RSC/MP client availability features 6-30 S SCF 1-32, 8-42 Security server hung, effect on client/server network 2-19 Self-testing hardware components 2-10 Semantic errors 9-1 Server class availability features, summary of 6-29, 6-30 definition of 6-4 NonStop TS/MP server 6-8 NonStop Tuxedo application
Index T SPI procedures 8-25, 8-27/8-28, 8-30, 8-39 SPI programming command-and-response interface 8-29/8-30 data definition file 8-29 event generation 8-27/8-28 management applications 8-38/8-39 overview of 8-28, 8-30 receiving responses 8-38/8-39 retrieving EMS messages 8-39 sending commands 8-38/8-39 standard event messages 8-28 when to use 8-28, 8-29, 8-38 SPI_BUFFER_FORMATnnnn DSM Template Services procedure 8-35 Standard event messages 8-21, 8-28 Stand-in processing 8-25 State changes, monitoring 8-1
Index T persistent process 1-18 purpose of 6-13 replicated process 6-16 TMF procedure calls 6-15 Terminal Control Process See TCP Testing 9-9/9-12 The iTP WebServer Architecture 3-13 This 6-1 Three-tier client/server model 10-9 Thresholds EMS message 8-22 exceeding 8-37 instrumentation of 8-9 purpose of 1-19, 2-16 TMF APIs to 4-9/4-12 ATMI API 4-10 COBOL API 4-12 COBOL85 API 4-12 Guardian API 4-12 NonStop SQL/MP API 4-12 NonStop Tuxedo transaction APIs 4-10 POET API 4-10/4-11 RSC/MP API 4-11 TX API 4-10 T
Index U process pair 2-12 software 2-12 Trickle catchup 4-23, 10-25 Triple modular redundancy 2-23 Triplex system 2-23 Triple-contingency configuration 4-14 Tuxedo See NonStop Tuxedo Tuxedo client process 5-1, 5-3 Tuxedo/WS availability features 1-29 client process 5-3 open environment, in the 5-1 purpose of 5-4 transaction support in 1-14 Two-phase commit protocol 4-4/4-6 TX API 4-10, 5-12 WebLogic Architecture 3-19 WebSafe2 Interface Driver (WID) 3-16 WSH process availability features 1-16, 1-17 failur
Index Special Characters Availability Guide for Application Design—525637-004 Index-24