HP Serviceguard Extended Distance Cluster for Linux A.11.20.
© Copyright 2006, 2013 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use, or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor’s standard commercial license. The information contained herein is subject to change without notice.
Contents Printing History ............................................................................................5 Preface........................................................................................................7 1 Disaster Recovery in a Serviceguard Cluster...................................................9 1.1 Evaluating the Need for Disaster Recovery Solution..................................................................9 1.2 What is a Disaster Recovery Architecture?..............
6 Disaster Scenarios and Their Handling........................................................37 7 Troubleshooting .......................................................................................43 7.1 Troubleshooting serviceguard-xdc packages..........................................................................43 A Managing an MD Device..........................................................................45 A.1 Viewing the Status of the MD Device ...........................................
Printing History Table 1 Editions and Releases Printing Date Part Number Edition Operating System Releases (see Note below) November 2006 T2808-90001 Edition 1 • Red Hat 4 U3 or later • Novell SUSE Linux Enterprise Server 9 SP3 or later • Novell SUSE Linux Enterprise Server 10 or later August 2007 T2808-90004 Edition 2 • Red Hat 4 U3 or later • Red Hat 5 or later • Novell SUSE Linux Enterprise Server 10 or later May 2008 T2808-90008 Edition 3 • Red Hat 4 U3 or later • Red Hat 5 or later • Nov
Preface This guide introduces the concept of Serviceguard Extended Distance Clusters (serviceguard-xdc). It describes how to configure and manage HP Serviceguard Extended Distance Clusters for Linux and the associated Software RAID functionality. In addition, this guide includes information on a variety of Hewlett-Packard (HP) high availability cluster technologies that provide disaster recovery for your mission-critical applications.
1 Disaster Recovery in a Serviceguard Cluster This chapter introduces a variety of Hewlett-Packard high availability cluster technologies that provide disaster recovery for your mission-critical applications. It is assumed that you are already familiar with Serviceguard high availability concepts and configurations. 1.1 Evaluating the Need for Disaster Recovery Solution Disaster recovery is the ability to restore applications and data within a reasonable period of time after a disaster.
Figure 1 High Availability Architecture. node 1 fails node 1 X pkg A pkg A disks node 2 pkg B pkg A mirrors pkg B disks pkg B mirrors Client Connections pkg A fails over to node 2 node 1 X pkg A disks pkg A mirrors node 2 pkg B pkg A pkg B disks pkg B mirrors Client Connections This architecture, which is typically implemented on one site in a single data center, is sometimes called a local cluster. For some installations, the level of protection given by a local cluster is insufficient.
Figure 2 Disaster Recovery Architecture 1.3 Understanding Types of Disaster Recovery Clusters To protect against multiple points of failure, cluster components must be geographically dispersed: nodes can be put in different rooms, on different floors of a building, or even in separate buildings or separate cities. The distance between the nodes is dependent on the types of disaster from which you need protection, and on the technology used to replicate data.
nodes in an extended distance cluster is set by the limits of the data replication technology and networking limits. An extended distance cluster is shown in Figure 3. NOTE: There are no rules or recommendations on how far the third location must be from the two main data centers. The third location can be as close as the room next door with its own power source or can be as far as in a site across town.
◦ Reduction in human intervention is also a reduction in human error. Disasters don’t happen often, so lack of practice and the stressfulness of the situation may increase the potential for human error. ◦ Automated recovery procedures and processes can be transparent to the clients. Even if recovery is automated, you may choose to, or need to recover from some types of disasters with manual recovery.
2 Building an Extended Distance Cluster Using Serviceguard and Software RAID Simple Serviceguard clusters are usually configured in a single data center, often in a single room, to provide protection against failures in CPUs, interface cards, and software. Extended Serviceguard clusters are specialized cluster configurations, which allow a single cluster to extend across two separate data centers for increased disaster recovery.
centers is connected to both the nodes via two FC switches in order to provide multiple paths. This configuration supports a distance up to 100 kms between datacenter1 and datacenter2. Figure 4 Two Data Center Setup Figure 4 shows a configuration that is supported with separate network and FC links between the data centers. In this configuration, the FC links and the Ethernet networks are not carried over DWDM links. But each of these links is duplicated between the two data centers, for redundancy.
Table 2 Link Technologies and Distances Type of Link Maximum Distance Supported Gigabit Ethernet Twisted Pair 50 meters Short Wave Fiber 500 meters Long Wave Fiber 10 kilometers Dense Wave Division Multiplexing (DWDM) 100 kilometers The development of DWDM technology allows designers to use dark fiber (high speed communication lines provided by common carriers) to extend the distances that were formerly subject to limits imposed by Fibre Channel for storage and Ethernet for network links.
2.4 Guidelines for Separate Network and Data Links • There must be less than 200 milliseconds of latency in the network between the data centers. • Routing is allowed to the third data center if a Quorum Server is used in that data center. • The maximum distance between the data centers for this type of configuration is currently limited by the maximum distance supported for the networking type or Fibre Channel link type being used, whichever is shorter.
See the HP Configuration Guide, available through your HP representative, for more information on supported devices. 2.
3 Configuring your Environment for Software RAID The previous chapters discussed conceptual information on disaster recovery architectures and procedural information on creating an extended distance cluster. This chapter discusses the procedures you need to follow to configure Software RAID in your extended distance cluster. 3.1 Understanding Software RAID Redundant Array of Independent Disks (RAID) is a mechanism that provides storage fault tolerance and, occasionally, better performance.
3.2.3 Installing serviceguard-xdc Software First time installation If you are installing the serviceguard-xdc software for the first time: 1. Check the Serviceguard version installed on your system by running the command: #rpm -q serviceguard 2. 3. 4. 5. If the Serviceguard version is A.11.19.xx, upgrade the Serviceguard version to A.11.20.20 and serviceguard-xdc version to A.11.20.20. For information about upgrading to Serviceguard A.11.20.20 and serviceguard-xdc A.11.20.
1. 2. Upgrade the Serviceguard version to A.11.20.10 and serviceguard-xdc version to A.11.20.10. For information on upgrading to Serviceguard A.11.20.10 and serviceguard-xdc A.11.20.10, see HP Serviceguard A.11.20.10 for Linux Release Notes. Upgrade the Serviceguard version to A.11.20.20 and serviceguard-xdc version to A.11.20.20. For information on upgrading to Serviceguard A.11.20.20 and serviceguard-xdc A.11.20.20, see HP Serviceguard A.11.20.20 for Linux Release Notes. 3.2.
hangs when access to a mirror is lost. However, the MD device resumes activity when the specified hang period expires. This ensures that no data is lost. This parameter is required to address a scenario where an entire datacenter fails but all its components do not fail at the same time but undergo a rolling failure. In this case, if the access to one disk is lost, the MD layer hangs and data is no longer written to it. Within the hang period, the node goes down and a cluster reformation takes place.
in a mirror being created of a size equal to the smaller of the two disks. Be sure to create the mirror using the persistent device names of the component devices. As mentioned earlier, the first step for enabling Software RAID in your environment is to create the Multiple Disk (MD) device using two underlying component disks. This MD device is a virtual device which ensures that any data written to it is written to both component disks.
a. Run the following command: # mdadm -A -R /dev/md0 /dev/hpdev/sde /dev/hpdev/sdf b. 3. To have the name of MD device consistent across the nodes, see step 2 as described in creating MD device. Stop the MD device on the other node by running the following command: # mdadm -S /dev/md0 You must stop the MD device soon after you assemble it on the second node. 4.
For example, if /dev/md0 and /dev/md1 are the two MD devices that are specified in the package configuration file, edit the filter in the /etc/lvm/lvm.conf file: filter = [ "a|/dev/./by-id/.|", "a|/dev/md0|", "a|/dev/md1|", "r|/dev/hpdev/md0_mirror0|" , "r|/dev/hpdev/md0_mirror1|" , "r|/dev/hpdev/md1_mirror0|" , "r|/dev/hpdev/md1_mirror1|" ] where /dev/hpdev/md1_mirror0 and /dev/hpdev/md1_mirror1 are the persistent device names for the physical disks that form the md device /dev/md1.
Repeat this procedure for every every node that you add in the cluster. 1. Start the package Starting a package configured for Software RAID is the same as starting any other package in Serviceguard for Linux. You also need to keep in mind a few guidelines before you enable Software RAID for a particular package. Following are some of these guidelines you need to follow: 28 • Ensure that the Quorum Server link is close to the Ethernet links in your setup.
4 Configuring the Serviceguard Cluster You must configure the Serviceguard Cluster before configuring the environment for serviceguard-xdc. For more information about configuring the Serviceguard cluster, see Managing HP Serviceguard A.11.20.20 for Linux.
5 Configuring Packages for Extended Distance Cluster Software Starting with A.11.20.10, HP Serviceguard introduces a unified method of configuring packages. Packages created with this method are referred to as modular packages. With this new method, you can configure any package using a single file. Similarly, using the modular package method, you can configure the packages in serviceguard-xdc environment.
# # # # # # # # # # # # # # # # # "rpo_target" is used to specify the Recovery Point Objective Target. This refers to the maximum time window allowed after which the raid system will be disabled to prevent large data loss, resulting in the package not being able to start. This is by default set to 0. Recommended value is more than the value set for "raid_monitor_interval". Possible values are: 1. -1 - To ignore the rpo_target check during startup 2. Any positive integer including zero.
# # # # # # # # # # # Specify the name of each volume group. For example, if this package uses your volume groups vg01 and vg02, enter: vg vg01 vg vg02 The volume group activation method is defined above. The filesystems associated with these volume groups are specified below. Legal values for vg: /^[0-9A-Za-z\/][0-9A-Za-z_.\/\-]*[0-9A-Za-z]$/, /^[0-9A-Za-z]$/.
service_name service_cmd service_restart service_fail_fast_enabled service_halt_timeout raid.monitor "$SGSBIN/raid_monitor $SG_PACKAGE" 3 yes 300 Editing Parameters in the Configuration File Use the information on the Package Configuration worksheet to complete the file. You can also refer to the comments in the configuration template for additional explanation of each parameter. You can include the following information: • xdc/xdc/rpo_target This parameter specifies the Recovery Point Objective Target.
Figure 5 Package failover sequence In this figure, nodes N1 and N2 are in Datacenter 1 at Site 1, while N3 and N4 are in Datacenter 2 in Site 2. In the package configuration file, you need to specify the failover sequence such that N1 of Site 1 is followed by a node in Site 2. In this figure, you need to specify that N1 is followed by N3. Similarly, specify that N2 of Site 1 is followed by N4.
IMPORTANT: It is recommended that you maintain an equal number of nodes at both sites during maintenance. Configuring the RAID Monitoring Service By default, RAID monitoring service is available as part of the package configuration file. If you need to configure more than one serviceguard-xdc package running in the same cluster, edit the service_name field as service_name. It must be unique for all the packages running on same Serviceguard cluster.
6 Disaster Scenarios and Their Handling The previous chapters provided information on deploying Software RAID in your environment. In this chapter, you will find information on how Software RAID addresses various disaster scenarios. All the disaster scenarios described in this section have the following three categories: • Disaster Scenario Describes the type of disaster and provides details regarding the cause and the sequence of failures leading to the disasters in the case of multiple failures.
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Occurs A package (P1) is running on a node (Node 1). The package uses a mirror (md0) that consists of two storage components - S1 (local to Node 1 -/dev/hpdev/mylink-sde ) and S2 (local to Node 2) The package (P1) fails over to Complete the following procedure to initiate Node 2 and starts running with a recovery: the mirror of md0 that consists of 1.
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Occurs This is a multiple failure scenario where the failures occur in a particular sequence in the configuration that corresponds to figure 2 where Ethernet and FC links do not go over DWDM. The package (P1) continues to run on Node 1 after the first failure, with the MD0 that consists of only S1.
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Occurs Recovery Process In this scenario, no attempts are made to repair the first failure until the second failure occurs. Complete the following procedure to initiate a recovery: 1. To recover from the first failure, restore the FC links between the data centers. As a result, S1 is accessible from N2. 2.
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Occurs Recovery Process In this case, the package (P1) runs with When the first failure occurs, the Complete the following steps to initiate a RPO-TARGET set to 60 seconds. package (P1) continues to run on recovery: In this case, initially the package (P1) N1 with md0 consisting of only 1. Restore the FC links between the data S1. centers. As a result, S2 is running on node N1.
7 Troubleshooting This chapter describes how to troubleshoot issues related to serviceguard-xdc package. 7.1 Troubleshooting serviceguard-xdc packages Symptom: /dev/hpdev/mylink-sde cannot be added to /dev/md0 which means /dev/md0 is running with only one mirror half. The following message is logged into the package log file: mdadm: /dev/hpdev/mylink-sde reports being an active member for /dev/md0, but a --re-add fails. mdadm: not performing --add as that would convert /dev/hpdev/mylink-sde in to a spare.
A Managing an MD Device This chapter includes additional information on how to manage the MD device. For the latest information on how to manage and MD device, see The Software-RAID HOWTO manual available at http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html.
A.1 Viewing the Status of the MD Device After creating an MD device, you can view its status. By doing so, you can remain informed of whether the device is clean, up and running, or if there are any errors. To view the status of the MD device, run the following command on any node: cat /proc/mdstat Immediately after the MD devices are created and during some recovery processes, the devices undergo a re-mirroring process. You can view the progress of this process in the /proc/mdstat file.
A.2 Stopping the MD Device After you create an MD device, it begins to run. You need to stop the device and add the configuration into the raid.conf file. To stop the MD device, run the following command: # mdadm -S When you stop this device, all resources that were previously occupied by this device are released. Also, the entry of this device is removed from the /proc/mdstat file.
A.3 Starting the MD Device After you create an MD device, you would need to stop and start the MD device to ensure that it is active. You would not need to start the MD device in any other scenario as this is handled by the serviceguard-xdc software.
A.4 Removing and Adding an MD Mirror Component Disk There are certain failure scenarios, where you would need to manually remove the mirror component of an MD device and add it again later. For example, if links between two data centers fail, you would need to remove and add the disks that were marked as failed disks. When a disk within an MD device fails, the /proc/mdstat file of the MD array displays a message.
Example 4 Adding a new disk as an MD component to /dev/md0 array To add a new disk to the/dev/md0 array, run the following command: # mdadm - -add /dev/md0 /dev/hpdev/sde Following is an example of the status message displayed in the /proc/mdstat file once the disk is added: Personalities : [raid1] md0 : active raid1 sde[2] sdf[0] 9766784 blocks [2/1] [U_] [=>...................] recovery = 8.9% (871232/9766784) finish=2.7min speed=54452K/sec unused devices: A.
Index C cluster FibreChannel, 16 cluster maintenance, 13 D data center, 10 data replication FibreChannel, 16 disaster recovery architecture, 10 automating, 12 definition, 9 evaluating need, 9 FibreChannel cluster, 16 managing, 12 services, 12 staffing and training, 13 prerequisites, 21 Software RAID guidelines, 28 understanding, 21 V Volume Groups Creating, 26 E evaluating need for disaster recovery solution, 9 Extended Distance Cluster configuring, 23 installing, 21 extended distance cluster building,