Understanding and Designing Serviceguard Disaster Tolerant Architectures Fourth Edition Manufacturing Part Number: T1906-90023 December 2007
Legal Notices © Copyright 2007 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use, or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor’s standard commercial license. The information contained herein is subject to change without notice.
Contents 1. Disaster Tolerance and Recovery in a Serviceguard Cluster Evaluating the Need for Disaster Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What is a Disaster Tolerant Architecture? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Understanding Types of Disaster Tolerant Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . Extended Distance Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents 4
Printing History Table 1 Editions and Releases Printing Date Part Number Edition Operating System Releases (see Note below) December 2006 B7660-90018 Edition 1 HP-UX 11i v1 and 11i v2 September 2007 B7660-90020 Edition 2 HP-UX 11i v1, 11i v2 and 11i v3 December 2007 T1906-90022 Edition 3 HP-UX 11i v1, 11i v2 and 11i v3 December 2007 T1906-90023 Edition 4 HP-UX 11i v1, 11i v2 and 11i v3 The third edition includes information on new features for Continentalclusters Maintenance mode, Disas
HP Printing Division: ESS Software Division Hewlett-Packard Co. 19111 Pruneridge Ave.
Preface The following guides describe disaster tolerant clusters solutions using Serviceguard, Serviceguard Extension for RAC, Metrocluster Continuous Access XP, Metrocluster Continuous Access EVA, Metrocluster EMC SRDF, and Continentalclusters: • Understanding and Designing Serviceguard Disaster Tolerant Architectures • Designing Disaster Tolerant HA Clusters Using Metrocluster and Continentalclusters The Understanding and Designing Serviceguard Disaster Tolerant Architectures user’s guide provides an
• Chapter 2, Designing a Continental Cluster, shows the creation of disaster tolerant solutions using the Continentalclusters product. • Chapter 3, Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP, shows how to integrate physical data replication via Continuous Access XP with metropolitan and continental clusters.
Guide to Disaster Tolerant Solutions Documents Use the following table as a guide for locating specific Disaster Tolerant Solutions documentation: Table 2 Disaster Tolerant Solutions Document Road Map To Set up Read Extended Distance Cluster for Serviceguard/ Serviceguard Extension for RAC Understanding and Designing Serviceguard Disaster Tolerant Architectures Metrocluster with Continuous Access XP Understanding and Designing Serviceguard Disaster Tolerant Architectures • Chapter 1: Disaster Toler
Table 2 Disaster Tolerant Solutions Document Road Map (Continued) To Set up Metrocluster with EMC SRDF Read Understanding and Designing Serviceguard Disaster Tolerant Architectures • Chapter 1: Disaster Tolerance and Recovery in a Serviceguard Cluster Designing Disaster Tolerant HA Clusters Using Metrocluster and Continentalclusters Continental Cluster • Chapter 1: Designing a Metropolitan Cluster • Chapter 5: Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF Unde
Table 2 Disaster Tolerant Solutions Document Road Map (Continued) To Set up Continental Cluster using Continuous Access EVA data replication Continental Cluster using EMC SRDF data replication Continental Cluster using other data replication Read Understanding and Designing Serviceguard Disaster Tolerant Architectures • Chapter 1: Disaster Tolerance and Recovery in a Serviceguard Cluster Designing Disaster Tolerant HA Clusters Using Metrocluster and Continentalclusters • Chapter 2: Designing a Conti
Table 2 Disaster Tolerant Solutions Document Road Map (Continued) To Set up Three Data Center Architecture Read Understanding and Designing Serviceguard Disaster Tolerant Architectures • Chapter 1: Disaster Tolerance and Recovery in a Serviceguard Cluster Designing Disaster Tolerant HA Clusters Using Metrocluster and Continentalclusters Maintenance Mode, Disaster Recovery Rehearsal, Data Replication Storage Failover Preview Cross-Subnet Configuration with Serviceguard or Metrocluster 12 • Chapter 1
On-line versions of the above documents and other HA documentation are available at http://docs.hp.com -> High Availability.
Related Publications The following documents contain additional useful information: • Clusters for High Availability: a Primer of HP Solutions, Second Edition.
Disaster Tolerance and Recovery in a Serviceguard Cluster 1 Disaster Tolerance and Recovery in a Serviceguard Cluster This guide introduces a variety of Hewlett-Packard high availability cluster technologies that provide disaster tolerance for your mission-critical applications. It is assumed that you are already familiar with Serviceguard high availability concepts and configurations.
Disaster Tolerance and Recovery in a Serviceguard Cluster Evaluating the Need for Disaster Tolerance Evaluating the Need for Disaster Tolerance Disaster tolerance is the ability to restore applications and data within a reasonable period of time after a disaster.
Disaster Tolerance and Recovery in a Serviceguard Cluster Evaluating the Need for Disaster Tolerance line inoperable as well as the computers. In this case disaster recovery would be moot, and local failover is probably the more appropriate level of protection. On the other hand, you may have an order processing center that is prone to floods in the winter. The business loses thousands of dollars a minute while the order processing servers are down.
Disaster Tolerance and Recovery in a Serviceguard Cluster What is a Disaster Tolerant Architecture? What is a Disaster Tolerant Architecture? In an Serviceguard cluster configuration, high availability is achieved by using redundant hardware to eliminate single points of failure. This protects the cluster against hardware faults, such as the node failure in Figure 1-1. Figure 1-1 High Availability Architecture.
Disaster Tolerance and Recovery in a Serviceguard Cluster What is a Disaster Tolerant Architecture? impact. For these types of installations, and many more like them, it is important to guard not only against single points of failure, but against multiple points of failure (MPOF), or against single massive failures that cause many components to fail, such as the failure of a data center, of an entire site, or of a small area.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Understanding Types of Disaster Tolerant Clusters To protect against multiple points of failure, cluster components must be geographically dispersed: nodes can be put in different rooms, on different floors of a building, or even in separate buildings or separate cities.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters formerly known as campus clusters, but that term is not always appropriate because the supported distances have increased beyond the typical size of a single corporate campus. The maximum distance between nodes in an Extended Distance Cluster is set by the limits of the data replication technology and networking limits. An Extended Distance Cluster is shown in Figure 1-3.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Benefits of Extended Distance Cluster 22 • This configuration implements a single Serviceguard cluster across two data centers, and uses either MirrorDisk/UX or Veritas VxVM mirroring from Symantec for data replication. No (cluster) license beyond Serviceguard is required for this solution, making it the least expensive to implement.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Extended Distance Cluster for RAC An Extended Distance Cluster for RAC merges Extended Distance Cluster with Serviceguard Extension for RAC (SGeRAC). SGeRAC is a specialized configuration that enables Oracle Real Application Clusters (RAC) to run in an HP-UX environment on high availability clusters.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Metropolitan Cluster A metropolitan cluster is a cluster that has alternate nodes located in different parts of a city or in adjacent cities. Putting nodes further apart increases the likelihood that alternate nodes will be available for failover in the event of a disaster.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Figure 1-4 Metropolitan Cluster East San Francisco Data Center A node 1 pkg A High Availability Network Robust Data Replication node 2 pkg B node 3 pkg C node 4 pkg D Robust Data Replication High Availability Network West San Francisco Data Center B arbitrator 1 arbitrator 2 Arbitrators Third Location A key difference between extended distance clusters and metropolitan clusters is the data r
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Benefits of Metrocluster 26 • Metrocluster offers a more resilient solution than Extended Distance Cluster, as it provides full integration between Serviceguard’s application package and the data replication subsystem. The storage subsystem is queried to determine the state of the data on the arrays. Metrocluster knows that application package data is replicated between two data centers.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters • Disk resynchronization is independent of CPU failure (that is, if the hosts at the primary site fail but the disk remains up, the disk knows it does not have to be resynchronized).
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters be implemented in an Extended Distance Cluster. Metrocluster always uses array-based replication/mirroring, and requires storage from the same vendor in both data centers (that is, a pair of XPs with Continuous Access, a pair of Symmetrix arrays with SRDF, or a pair of EVAs with Continuous Access).
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters configuration. In this architecture, each cluster maintains its own quorum, so an arbitrator data center is not used for a continental cluster. A continental cluster can use any WAN connection via a TCP/IP protocol; however, due to data replication needs, high speed connections such as T1 or T3/E3 leased lines or switched lines may be required. See Figure 1-5.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters • The physical connection is one or more leased lines managed by a common carrier. Common carriers cannot guarantee the same reliability that a dedicated physical cable can. The distance can introduce a time lag for data replication, which creates an issue with data currency.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters • You can integrate Continentalclusters with any storage component of choice that is supported by Serviceguard. Continentalclusters provides a structure to work with any type of data replication mechanism.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters • Single instance applications using Veritas Cluster Volume Manager (CVM) or Veritas Cluster File System (CFS) are supported by Continentalclusters. • Configuration of multiple recovery pairs is allowed. A recovery pair in a continental cluster consists of two Serviceguard clusters. One functions as a primary cluster and the other functions as recovery cluster for a specific application.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters NOTE Maintenance mode is an optional feature. To enable maintenance mode, configure a shared disk (non-replicated) with a file system on all recovery clusters and the Continentalclusters configuration file should be specified with the CONTINENTAL_CLUSTER_STATE_DIR. A recovery group is moved into maintenance mode, by default, only if its primary package is running.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters package which has a package configuration that is similar to that of the recovery package and thereby verifying the recovery environment and procedure. The cmrecovercl option {-r -g } is used to start rehearsal for a recovery group on the recovery cluster. NOTE DR Rehearsal startup is allowed only if the recovery group is in maintenance mode.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Continental Cluster With Cascading Failover A continental cluster with cascading failover uses three main data centers distributed between a metropolitan cluster, which serves as a primary cluster, and a standard cluster, which serves as a recovery cluster.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters On site 2, a local mirror is associated with the destination devices (labeled as device B’). The mirror technology is storage specific (for example, Business Copy). This local mirror also acts as a source device for recovery during rolling disasters. A rolling disaster is defined as a disaster that occurs before the cluster is able to recover from a non-disastrous failure.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Figure 1-6 Chapter 1 Cascading Failover Data Center Distribution Using Metrocluster 37
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Three Data Center Architecture A Three Data Center solution integrates Serviceguard, Metrocluster Continuous Access XP, Continentalclusters and HP StorageWorks XP 3DC Data Replication Architecture. This configuration protects against local and wide-area disasters by using both synchronous replication (for data consistency) and Continuous Access journaling (for long-distance replication).
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Figure 1-7 Chapter 1 Three Data Center Solution Overview 39
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters HP XP StorageWorks in a Three Data Center Architecture HP XP StorageWorks Three Data Center architecture enables data to be replicated over three data centers concurrently using a combination of Continuous Access Synchronous and Continuous Access Journaling data replication. In a XP 3DC design there are two available configurations; Multi-Target and Multi-Hop.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Comparison of Disaster Tolerant Solutions Table 1-1 summarizes and compares the disaster tolerant solutions that are currently available: Table 1-1 Attributes Key Benefit Comparison of Disaster Tolerant Cluster Solutions Extended Distance Cluster Extended Distance Cluster for RAC Excellent in “normal” operations, and partial failure.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Table 1-1 Attributes Key Limitation 42 Comparison of Disaster Tolerant Cluster Solutions (Continued) Extended Distance Cluster Extended Distance Cluster for RAC No ability to check the state of the data before starting up the application. If the volume group (vg) can be activated, the application will be started.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Table 1-1 Attributes Maximum Distance Comparison of Disaster Tolerant Cluster Solutions (Continued) Extended Distance Cluster * 100 Kilometers Extended Distance Cluster for RAC * 100km (maximum is 2 nodes, with either SLVM or CVM) * 10km (maximum is 2 nodes with SLVM and 8 nodes with CVM and CFS) Data Replication mechanism Chapter 1 Host-based, via MirrorDisk/UX or (Veritas) VxVM.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Table 1-1 Attributes Application Failover Comparison of Disaster Tolerant Cluster Solutions (Continued) Extended Distance Cluster Automatic (no manual intervention required). Extended Distance Cluster for RAC Instance is already running at the 2nd site. Metrocluster Automatic (no manual intervention required).
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Table 1-1 Attributes Maximum Cluster Size Allowed Comparison of Disaster Tolerant Cluster Solutions (Continued) Extended Distance Cluster 2 to 16 nodes (up to 4 when using dual lock disks). Extended Distance Cluster for RAC * 2, 4, 6, or 8 nodes with SLVM or CVM with a maximum distance of 100km.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Table 1-1 Attributes Cluster Network Comparison of Disaster Tolerant Cluster Solutions (Continued) Extended Distance Cluster Single IP subnet, Cross Subnet Extended Distance Cluster for RAC Single IP subnet Metrocluster Single IP subnet, Cross-Subnet Continentalclusters Two configurations: Single IP subnet for both clusters (LAN connection between clusters) Two IP subnets – one per cluster (WA
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters For the most up-to-date support and compatibility information see the SGeRAC for SLVM, CVM & CFS Matrix and Serviceguard Compatibility and Feature Matrix on http://docs.hp.com -> High Availability -> Serviceguard Extension for Real Application Cluster (ServiceGuard OPS Edition) -> Support Matrixes.
Disaster Tolerance and Recovery in a Serviceguard Cluster Understanding Types of Disaster Tolerant Clusters Table 1-2 Supported Distances Extended Distance Cluster Configurations Cluster type/Volume Manager 48 Distances up to 10 kilometers Distances up to 100 kilometers Serviceguard A.11.17 with CVM 4.1 or CFS 4.1 mirroring Supported for clusters with 2, 4, 6, or 8 nodes with Serviceguard A.11.17 on 11i v2 Supported for clusters with 2, 4, 6 or 8 nodes with Serviceguard A.11.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Disaster Tolerant Architecture Guidelines Disaster tolerant architectures represent a shift away from the massive central data centers and towards more distributed data processing facilities.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Protecting Data through Replication The most significant losses during a disaster are the loss of access to data, and the loss of data itself. You protect against this loss through data replication, that is, creating extra copies of the data. Data replication should: • Ensure data consistency by replicating data in a logical order so that it is immediately usable or recoverable.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines depending on the volume of data. Some applications, depending on the role they play in the business, may need to have a faster recovery time, within hours or even minutes. On-line Data Replication On-line data replication is a method of copying data from one site to another across a link. It is used when very short recovery time, from minutes to hours, is required.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Figure 1-8 Physical Data Replication node 1 node 1a Replication The distance between nodes is limited by the link type (SCSI, ESCON, FC) and the intermediate devices (FC switch, DWDM) on the link path node 1 node 1a Physical Replication in software (MirrorDisk/UX). Direct access to both copies of data is optional. Physical Replication in Hardware (XP array).
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Disadvantages of physical replication in hardware are: • The logical order of data writes is not always maintained in synchronous replication. When a replication link goes down and transactions continue at the primary site, writes to the primary disk are queued in a bit-map.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines • Because there are multiple read devices, that is, the node has access to both copies of data, there may be improvements in read performance. • Writes are synchronous unless the link or disk is down. Disadvantages of physical replication in software are: • As with physical replication in the hardware, the logical order of data writes is not maintained.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Figure 1-9 Logical Data Replication node 1 node 1a Network Logical Replication in Software. No direct access to both copies of data. Advantages of using logical replication are: • The distance between nodes is limited only by the networking technology. • There is no additional hardware needed to do logical replication, unless you choose to boost CPU power and network bandwidth.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines • If the primary database fails and is corrupt, which results in the replica taking over, then the process for restoring the primary database so that it can be used as the replica is complex. This often involves recreating the database and doing a database dump from the replica. • Applications often have to be modified to work in an environment that uses a logical replication database.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Figure 1-10 Alternative Power Sources Power Circuit 1 Power Circuit 2 node 1 node 3 node 2 node 4 Data Center A Power Circuit 3 Power Circuit 4 Data Center B Housing remote nodes in another building often implies they are powered by a different circuit, so it is especially important to make sure all nodes are powered from a different source if the disaster tolerant cluster is located in two data c
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Disaster Tolerant Local Area Networking The configurations described in this section are for FDDI and Ethernet based Local Area Networks. Figure 1-11 Reliability of the Network is Paramount node 1 node 2 Wrong: Cables use same route. Data Center A node 1 Right: Cables use different route. Data Center A node 1a node 2a X Accident severs both network cables. Data Center B Disaster recovery impossible.
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Figure 1-12 Highly Available FDDI Network: Two Options node 1 C C C C node 4 node 2 S S Data Center B Data Center A S C= S node 3 S FDDI SAS (Single Attach) with concentrator = FDDI DAS (Dual Attach) with bypass switch node 5 node 6 Data Center C Ethernet networks can also be used to connect nodes in a disaster tolerant architecture within the following guidelines: Chapter 1 • Each node
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines Figure 1-13 Routing Highly Available Ethernet Connections in Opposite Directions node 1 From A to B hub bridge bridge hub node 2 node 3 node 4 bridge hub Fr hub bridge om A Data Center B to Data Center A C bridge bridge hub hub node 5 node 6 Data Center C Disaster Tolerant Wide Area Networking Disaster tolerant networking for continental clusters is directly tied to the data replication me
Disaster Tolerance and Recovery in a Serviceguard Cluster Disaster Tolerant Architecture Guidelines — ATM: high end • Reliability affects whether or not data replication happens, and therefore the consistency of the data should you need to fail over to the recovery cluster. Redundant leased lines should be used, and should be from two different common carriers, if possible. • Cost influences both bandwidth and reliability. Higher bandwidth and dual leased lines cost more.
Disaster Tolerance and Recovery in a Serviceguard Cluster Managing a Disaster Tolerant Environment Managing a Disaster Tolerant Environment In addition to the changes in hardware and software to create a disaster tolerant architecture, there are also changes in the way you manage the environment. Configuration of a disaster tolerant architecture needs to be carefully planned, implemented and maintained.
Disaster Tolerance and Recovery in a Serviceguard Cluster Managing a Disaster Tolerant Environment Even if recovery is automated, you may choose to, or need to recover from some types of disasters with manual recovery. A rolling disaster, which is a disaster that happens before the cluster has recovered from a previous disaster, is an example of when you may want to manually switch over.
Disaster Tolerance and Recovery in a Serviceguard Cluster Additional Disaster Tolerant Solutions Information Additional Disaster Tolerant Solutions Information For information on how to build, configure, and manage disaster tolerant cluster solutions using Metrocluster Continuous Access XP, Metrocluster Continuous Access EVA, Metrocluster EMC SRDF, Continentalclusters, and Three Data Center Architecture refer to the following guide: • Designing Disaster Tolerant HA Clusters Using Metrocluster and Continen
Building an Extended Distance Cluster Using Serviceguard 2 Building an Extended Distance Cluster Using Serviceguard Simple Serviceguard clusters are usually configured in a single data center, often in a single room, to provide protection against failures in CPUs, interface cards, and software. Extended Serviceguard clusters are specialized cluster configurations, which allow a single cluster to extend across two or three separate data centers for increased disaster tolerance.
Building an Extended Distance Cluster Using Serviceguard Types of Data Link for Storage and Networking Types of Data Link for Storage and Networking FibreChannel technology lets you increase the distance between the components in a Serviceguard cluster, thus making it possible to design a disaster tolerant architecture. The following table shows some of the distances possible with a few of the available technologies, including some of the FibreChannel alternatives.
Building an Extended Distance Cluster Using Serviceguard Types of Data Link for Storage and Networking NOTE Chapter 2 Increased distance often means increased cost and reduced speed of connection. Not all combinations of links are supported in all cluster types. For a current list of supported configurations and supported distances, refer to the HP Configuration Guide, available through your HP representative. As new technologies become supported, they will be described in that guide.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture Two Data Center Architecture The two data center architecture is based on a standard Serviceguard configuration with half of the nodes in one data center, and the other half in another data center. Nodes can be located in separate data centers in the same building, or even separate buildings within the limits of FibreChannel technology.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture Chapter 2 • MirrorDisk/UX mirroring for LVM and VxVM mirroring are supported for clusters of 2 or 4 nodes. However, the dual cluster lock devices can only be configured in LVM Volume Groups. • There can be separate networking and FibreChannel links between the two data centers, or both networking and Fibre Channel can go over DWDM links between the two data centers.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture • Due to the maximum of 3 images (1 original image plus two mirror copies) allowed in MirrorDisk/UX, if JBODs are used for application data, only one data center can contain JBODs while the other data center must contain disk arrays with hardware mirroring. Note that having three mirror copies will affect performance on disk writes. VxVM and CVM 3.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture network to provide backup for both the heartbeat network and the RAC cache fusion network, however it can only provide failover capability for one of these networks at a time. • NOTE Serviceguard Extension for Faster Failover (SGeFF) is not supported in a two data center architecture, which requires a two-node cluster and the use of a quorum server.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture Two Data Center FibreChannel Implementations FibreChannel Using Hubs In a two data center configuration, shown in Figure 2-1, it is required to use a cluster lock disk, which is only supported for up to 4 nodes. This configuration can be implemented using any HP-supported FibreChannel devices. Disks must be available from all nodes using redundant links. Not all links are shown in Figure 2-1.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture Figure 2-2 Chapter 2 Two Data Centers with FibreChannel Switches and FDDI 73
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture DWDM with Two Data Centers Figure 2-3 is an example of a two data center configuration using DWDM for both storage and networking.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture Cross-Subnet Configuration with Two Data Centers Figure 2-4 is an example of a two data center configuration using DWDM for both storage and networking.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture Cross-Subnet Configurations As of Serviceguard A.11.18 it is possible to configure multiple subnets, joined by a router, both for the cluster heartbeat and for data, with some nodes using one subnet and some another. A cross-subnet configuration allows: NOTE • Automatic package failover from a node on one subnet to a node on another • A cluster heartbeat that spans subnets.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture • Because Veritas Cluster File System from Symantec (CFS) requires link-level traffic communication (LLT) among the nodes, Serviceguard cannot be configured in cross-subnet configurations with CFS alone. But CFS is supported in specific cross-subnet configurations with Serviceguard and HP add-on products such as Serviceguard Extension for Oracle RAC (SGeRAC); see the documentation listed below.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture IMPORTANT Although this topology can be implemented on a single site, it is most commonly used by extended-distance clusters, and specifically site-aware disaster-tolerant clusters, which require HP add-on software. Design and configuration of such clusters are covered in the disaster-tolerant documentation delivered with Serviceguard. For more information, see the following documents at http://www.docs.hp.
Building an Extended Distance Cluster Using Serviceguard Two Data Center Architecture Advantages and Disadvantages of a Two Data Center Architecture The advantages of a two data center architecture are: • Lower cost. • Only two data centers are needed, meaning less space and less coordination between operations staff. • No arbitrator nodes are needed.
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures Two Data Center and Third Location Architectures A two data center and third location have the following configuration requirements: NOTE There is no hard requirement on how far the third location has to be from the two main data centers. The third location can be as close as the room next door with its own power source or can be as far as in another site across town.
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures Serviceguard cluster, or to configure the LAN used for the Quorum Server IP address with at least two LAN interface cards using APA (Automatic Port Aggregation) LAN_MONITOR mode to improve the availability if a LAN failure occurs. Prior to Quorum Server revision A.02.00, it was not supported to run the Quorum Server in a Serviceguard cluster.
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures 82 • Application data must be mirrored between the primary data centers. If MirrorDisk/UX is used, Mirror Write Cache (MWC) must be the Consistency Recovery policy defined for all mirrored logical volumes. This will allow for resynchronization of stale extents after a node crash, rather than requiring a full resynchronization.
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures • Veritas CVM mirroring is supported for Serviceguard, Serviceguard OPS Edition, or Serviceguard Extension for RAC clusters for distances up to 10 kilometers for 2, 4, 6, or 8 node clusters, and up to 100 kilometers for 2 node clusters*. Since CVM 3.
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures The following table shows the possible configurations using a three data center architecture. Table 2-2 Supported System and Data Center Combinations Data Center A 84 Data Center B Data Center C Serviceguard Version 1 1 1 Arbitrator Node A.11.13 or later 1 1 Quorum Server System A.11.13 or later 1 1 Quorum Server System A.11.16 or later (including SGeFF) 2 1 2 Arbitrator Nodes A.
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures Table 2-2 Supported System and Data Center Combinations (Continued) Data Center A Data Center B Data Center C Serviceguard Version 4 4 2* Arbitrator Nodes A.11.13 or later 4 4 Quorum Server System A.11.13 or later 5 5 1 Arbitrator Node A.11.13 or later 5 5 2* Arbitrator Nodes A.11.13 or later 5 5 Quorum Server System A.11.13 or later 6 6 1 Arbitrator Node A.11.
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures NOTE Serviceguard Extension for RAC clusters are limited to 2, 4, 6, or 8 nodes.
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures Figure 2-5 Chapter 2 Two Data Centers and Third Location with DWDM and Arbitrators 87
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures Figure 2-6 88 Two Data Centers and Third Location with DWDM and Quorum Server Chapter 2
Building an Extended Distance Cluster Using Serviceguard Two Data Center and Third Location Architectures Figure 2-6 is an example of a two data center and third location configuration using DWDM, with a quorum server node on the third site and is specifically for a SGeRAC cluster. The DWDM boxes connected between the two Primary Data Centers are configured with redundant dark fibre links and the standby fibre feature has been enabled.
Building an Extended Distance Cluster Using Serviceguard Rules for Separate Network and Data Links Rules for Separate Network and Data Links 90 • The network interfaces used must support DLPI (link level). • There must be less than 200 milliseconds of latency in the network between the data centers. • No routing is allowed for the networks between the data centers. • Routing is allowed to the third data center if a Quorum Server is used in that data center.
Building an Extended Distance Cluster Using Serviceguard Rules for Separate Network and Data Links Chapter 2 • There must be at least two alternately routed Fibre Channel Data Replication links between each data center. If a third location is used for arbitrator nodes, no Fibre Channel Data Replication links are required for third location. • Fibre Channel hubs are only supported for distances up to 10 kilometers. For distances longer than 10 kilometers, Fibre Channel switches are required.
Building an Extended Distance Cluster Using Serviceguard Guidelines on DWDM Links for Network and Data Guidelines on DWDM Links for Network and Data 92 • The network interfaces used must support DLPI (link level). • There must be less than 200 milliseconds of latency in the network between the data centers. • No routing is allowed for the networks between the data centers. • Routing is allowed to the third data center if a Quorum Server is used in that data center.
Building an Extended Distance Cluster Using Serviceguard Guidelines on DWDM Links for Network and Data • FibreChannel switches must be used in a DWDM configuration; FibreChannel hubs are not supported. Direct Fabric Attach mode must be used for the ports connected to the DWDM link. Refer to the HP Configuration Guide, available through your HP representative, for more information on supported devices.
Building an Extended Distance Cluster Using Serviceguard Additional Disaster Tolerant Solutions Information Additional Disaster Tolerant Solutions Information For information on how to build, configure, and manage disaster tolerant cluster solutions using Metrocluster Continuous Access XP, Metrocluster Continuous Access EVA, Metrocluster EMC SRDF, Continentalclusters and Three Data Center Architecture refer to the following guide: • Designing Disaster Tolerant HA Clusters Using Metrocluster and Continenta
Glossary Business Recovery Service Glossary A application restart Starting an application, usually on another node, after a failure. Application can be restarted manually, which may be necessary if data must be restarted before the application can run (example: Business Recovery Services work like this.) Applications can by restarted by an operator using a script, which can reduce human error.
Glossary campus cluster C campus cluster A single cluster that is geographically dispersed within the confines of an area owned or leased by the organization such that it has the right to run cables above or below ground between buildings in the campus. Campus clusters are usually spread out in different rooms in a single building, or in different adjacent or nearby buildings. See also Extended Distance Cluster.
Glossary database replication consistency group A set of Symmetrix RDF devices that are configured to act in unison to maintain the integrity of a database. Consistency groups allow you to configure R1/R2 devices on multiple Symmetrix frames in Metrocluster with EMC SRDF. continental cluster A group of clusters that use routed networks and/or common carrier networks for data replication and cluster communication to support package failover between separate clusters in different data centers.
Glossary disaster disaster An event causing the failure of multiple components or entire data centers that render unavailable all services at a single location; these include natural disasters such as earthquake, fire, or flood, acts of terrorism or sabotage, large-scale power outages. remote site.
Glossary mirroring filesystem replication The process of replicating filesystem changes from one node to another. filesystem or the database. Complex transactions may result in the modification of many diverse physical blocks on the disk. G LUN (Logical Unit Number) A SCSI term that refers to a logical disk device composed of one or more physical disk mechanisms, typically configured into a RAID level.
Glossary mission critical application mission critical application Hardware, software, processes and support services that must meet the uptime requirements of an organization. Examples of mission critical application that must be able to survive regional disasters include financial trading services, e-business operations, 911 phone service, and patient record databases. O off-line data replication.
Glossary recovery package example EMC’s Symmetrix Remote Data Facility or the HP StorageWorks E Disk Array XP Series Continuous Access), or software-based where data is replicated on multiple disks using dedicated software on the primary node (for example, MirrorDisk/UX). planned downtime An anticipated period of time when nodes are taken down for hardware maintenance, software maintenance (OS and application), backup, reorganization, upgrades (software or hardware), etc.
Glossary regional disaster regional disaster A disaster, such as an earthquake or hurricane, that affects a large region. Local, campus, and proximate metropolitan clusters are less likely to protect from regional disasters. remote failover Failover to a node at another data center or remote location. resynchronization The process of making the data between two sites consistent and current once systems are restored following a failure. Also called data resynchronization.
Glossary WAN data replication solutions replication. Minimizes the chance of inconsistent or corrupt data in the event of a rolling disaster. T transaction processing monitor (TPM) Software that allows you to modify an application to store in-flight transactions in an external location until that transaction has been committed to all possible copies of the database or filesystem, thus ensuring completion of all copied transactions.
Glossary WAN data replication solutions 104 Glossary
Index Numerics 3DC, 38 A asynchronous data replication, 51 C cluster extended distance, 20, 22, 23 FibreChannel, 66 metropolitan, 24 wide area, 28 cluster maintenance, 63 configuring, 59 disaster tolerant Ethernet networks, 59 disaster tolerant FDDI networks, 58 disaster tolerant WAN, 60 consistency of data, 50 continental cluster, 28 currency of data, 50 D DAS FDDI, configuring, 58 data center, 19 data consistency, 50 data currency, 50 data recoverability, 50 data replication, 50 FibreChannel, 66 ideal, 56
Index online data replication, 51 operations staff general guidelines, 63 P physical data replication, 51 power sources redundant, 56 R RAC cluster extended distance, 23 extended distance for RAC, 20 recoverability of data, 50 redundant power sources, 56 replicating data, 50 off-line, 50 online, 51 rolling disaster, 63 rolling disasters, 61 S SAS FDDI, configuring, 58 Serviceguard, 18 single point of failure, 72 split brain syndrome, 79 synchronous data replication, 51 T Three Data Center, 38 W WAN configur