HP Serviceguard Extended Distance Cluster for Linux A.12.00.00 Deployment Guide, March 2014

ManualsBrandsHP ManualsSoftwareHP SAP Linux Serviceguard Cluster Extension

HP Serviceguard Extended Distance Cluster

for Linux A.12.00.00 Deployment Guide

HP Part Number: 719660-004

Published: March 2014

Edition: 7

Summary of content (53 pages)

PAGE 1
HP Serviceguard Extended Distance Cluster for Linux A.12.00.
PAGE 2
© Copyright 2006, 2014 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use, or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor’s standard commercial license. The information contained herein is subject to change without notice.
PAGE 3
Contents Printing History ............................................................................................5 Preface........................................................................................................7 1 Disaster Recovery in a Serviceguard Cluster...................................................9 1.1 Evaluating the Need for Disaster Recovery Solution..................................................................9 1.2 What is a Disaster Recovery Architecture?..............
PAGE 4
6 Disaster Scenarios and Their Handling........................................................39 7 Troubleshooting .......................................................................................47 7.1 Troubleshooting serviceguard-xdc packages..........................................................................47 7.2 Troubleshooting VxVM Mirroring Package............................................................................47 A Managing an MD Device.......................................
PAGE 5
Printing History Table 1 Editions and Releases Printing Date Part Number Edition Operating System Releases (see Note below) November 2006 T2808-90001 Edition 1 • Red Hat 4 U3 or later • Novell SUSE Linux Enterprise Server 9 SP3 or later • Novell SUSE Linux Enterprise Server 10 or later August 2007 T2808-90004 Edition 2 • Red Hat 4 U3 or later • Red Hat 5 or later • Novell SUSE Linux Enterprise Server 10 or later May 2008 T2808-90008 Edition 3 • Red Hat 4 U3 or later • Red Hat 5 or later • Nov
PAGE 6
PAGE 7
Preface This guide introduces the concept of Serviceguard Extended Distance Clusters (serviceguard-xdc). It describes how to configure and manage HP Serviceguard Extended Distance Clusters for Linux and the associated Software RAID functionality. In addition, this guide includes information on a variety of Hewlett-Packard (HP) high availability cluster technologies that provide disaster recovery for your mission-critical applications.
PAGE 8
PAGE 9
1 Disaster Recovery in a Serviceguard Cluster This chapter introduces a variety of Hewlett-Packard high availability cluster technologies that provide disaster recovery for your mission-critical applications. It is assumed that you are already familiar with Serviceguard high availability concepts and configurations. 1.1 Evaluating the Need for Disaster Recovery Solution Disaster recovery is the ability to restore applications and data within a reasonable period of time after a disaster.
PAGE 10
Figure 1 High Availability Architecture. node 1 fails node 1 X pkg A pkg A disks node 2 pkg B pkg A mirrors pkg B disks pkg B mirrors Client Connections pkg A fails over to node 2 node 1 X pkg A disks pkg A mirrors node 2 pkg B pkg A pkg B disks pkg B mirrors Client Connections This architecture, which is typically implemented on one site in a single data center, is sometimes called a local cluster. For some installations, the level of protection given by a local cluster is insufficient.
PAGE 11
Figure 2 Disaster Recovery Architecture 1.3 Understanding Types of Disaster Recovery Clusters To protect against multiple points of failure, cluster components must be geographically dispersed: nodes can be put in different rooms, on different floors of a building, or even in separate buildings or separate cities. The distance between the nodes is dependent on the types of disaster from which you need protection, and on the technology used to replicate data.
PAGE 12
nodes in an extended distance cluster is set by the limits of the data replication technology and networking limits. An extended distance cluster is shown in Figure 3. NOTE: There are no rules or recommendations on how far the third location must be from the two main data centers. The third location can be as close as the room next door with its own power source or can be as far as in a site across town.
PAGE 13
◦ Reduction in human intervention is also a reduction in human error. Disasters don’t happen often, so lack of practice and the stressfulness of the situation may increase the potential for human error. ◦ Automated recovery procedures and processes can be transparent to the clients. Even if recovery is automated, you may choose to, or need to recover from some types of disasters with manual recovery.
PAGE 14
PAGE 15
2 Building an Extended Distance Cluster Using Serviceguard and Software RAID Simple Serviceguard clusters are usually configured in a single data center, often in a single room, to provide protection against failures in CPUs, interface cards, and software. Extended Serviceguard clusters are specialized cluster configurations, which allow a single cluster to extend across two separate data centers for increased disaster recovery.
PAGE 16
imposed by the Fibre Channel link for storage and Ethernet for networks. Storage in both data centers is connected to both the nodes via two FC switches in order to provide multiple paths. This configuration supports a distance up to 100 kms between datacenter1 and datacenter2. Figure 4 Two Data Center Setup Figure 4 shows a configuration that is supported with separate network and FC links between the data centers.
PAGE 17
2.2 Types of Data Link for Storage and Networking Fibre Channel technology lets you increase the distance between the components in an Serviceguard cluster, thus making it possible to design a disaster recovery architecture. The following table shows some of the distances possible with a few of the available technologies, including some of the Fiber Optic alternatives.
PAGE 18
• Fibre Channel Direct Fabric Attach (DFA) is recommended over Fibre Channel Arbitrated loop configurations, due to the superior performance of DFA, especially as the distance increases. Therefore Fibre Channel switches are recommended over Fibre Channel hubs. • For disaster recovery, application data must be mirrored between the primary data centers. You must ensure that the mirror copies reside in different data centers, as the software cannot determine the locations.
PAGE 19
be configured. If the DWDM box supports multiple active DWDM links, that feature can be used instead of the redundant standby feature. • At least two dark fiber optic links are required between each Primary data center, each fibre link routed differently to prevent the “backhoe problem.
PAGE 20
PAGE 21
3 Configuring your Environment for Software RAID The previous chapters discussed conceptual information on disaster recovery architectures and procedural information on creating an extended distance cluster. This chapter discusses the procedures you need to follow to configure Software RAID in your extended distance cluster. 3.1 Understanding Software RAID Redundant Array of Independent Disks (RAID) is a mechanism that provides storage fault tolerance and, occasionally, better performance.
PAGE 22
3.3.1.1 Setting the Value of the Link Down Timeout Parameter After you install, you must set the Link Down Timeout parameter for the Fibre Channel cards to a duration equal to the cluster reformation time. The value of cluster reformation time parameter depends on the heartbeat interval and the node timeout values configured in a particular cluster.
PAGE 23
in the device name prevents the MD mirror from starting. To avoid this problem, HP requires that you make the device names persistent. When there is a disk related failure and subsequent reboot, there is a possibility that the devices are renamed. Linux names disks in the order they are found. The device that was /dev/sdf may be renamed to /dev/sde if any “lower” device is failed or removed. As a result, you cannot activate the MD device with the original name.
PAGE 24
3.3.3.1 Creating and Assembling an MD Device This example shows how to create the MD device /dev/md0, you must create it from a LUN of storage device 1 (/dev/hpdev/sde ) and another LUN from storage 2 (/dev/hpdev/sdf ) . To create an MD device: 1. Run the following command: # mdadm --create --verbose /dev/md0 --name=0 --level=1 --raid-devices=2 /dev/hpdev/sde /dev/hpdev/sdf 2. To have the name of MD device consistent across the nodes, copy the output of # mdadm –Db /dev/md0 command to /etc/mdadm.
PAGE 25
IMPORTANT: package. You need to repeat this procedure to create all MD devices that are used in a When data is written to this device, the MD driver writes to both the underlying disks. In case of read requests, the MD reads from one device or the other based on its algorithms. After creating this device you treat it like any other LUN that is going to have shared data in a Serviceguard environment and then create a logical volume and a file system on it. 3.3.
PAGE 26
NOTE: If you have added more than one filter in the /etc/lvm/lvm.conf file, the following warning message will be displayed: WARNING! Ignoring duplicate config node: filter (seeking filter) The workaround is to edit the/etc/lvm/lvm.conf file to have only one filter as shown in the example. For example, let us assume/etc/lvm/lvm.conf file has two filters: filter = [ "a|/dev/./by-id/.
PAGE 27
To configure the storage to use in serviceguard-xdc environment: 1. Create a site-consistent disk group with storage across the sites. To create a site-consistent disk group: #vxdg -g dg_dd1 set siteconsistent=on For information about creating a disk group, see Veritas Storage Foundation Cluster File System High Availability Administrator's Guide available at https://sort.symantec.com/ documents/doc_details/sfha/6.0.1/Linux/ProductGuides. 2. 3. Make the existing disk group site-consistent.
PAGE 28
PAGE 29
4 Configuring the Serviceguard Cluster You must configure the Serviceguard Cluster before configuring the environment for serviceguard-xdc. For more information about configuring the Serviceguard cluster, see Managing HP Serviceguard A.12.00.00 for Linux.
PAGE 30
PAGE 31
5 Configuring Packages for Extended Distance Cluster Software Starting with A.11.20.10, HP Serviceguard introduces a unified method of configuring packages. Packages created with this method are referred to as modular packages. With this new method, you can configure any package using a single file. Similarly, using the modular package method, you can configure the packages in serviceguard-xdc environment.
PAGE 32
# # # # # # # # # # # Raid Monitor Interval "raid_monitor_interval" is the time interval, in seconds, the raid monitor script will wait between checks to verify the accessibility of the component mirror disks. If the component mirror disk becomes accessible, the raid monitor script will add it back to the MD device. The default value of "raid_monitor_interval" is set to 30 seconds. Legal values for xdc/xdc/raid_monitor_interval: (value > 0).
PAGE 33
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # "service_name", "service_cmd", "service_restart", "service_fail_fast_enabled" and "service_halt_timeout" specify a service for this package. "service_cmd" is the command line to be executed to start the service. The value for "service_restart" can be "unlimited", "none" or any positive integer value.
PAGE 34
# vxvm_dg dg01 # vxvm_dg dg02 # # NOTE: A package can have a mix of LVM volume groups and VxVM disk groups. # # NOTE: When VxVM is initialized it will store the hostname of the local node in its volboot file in a variable called ‘hostid’. The Serviceguard package control scripts use both the hostname(1m) command and the VxVM hostid. This means you must make sure that the VxVM hostid matches the output of the hostname(1m) command.
PAGE 35
resulting in the package not being able to start. By default, this parameter is set to 0. The value set for this parameter must be more than the value set for the xdc/xdc/raid_monitor_interval parameter. Possible values are: • ◦ -1 — To ignore the rpo_target check during startup. ◦ Any positive integer including zero.
PAGE 36
Figure 5 Package failover sequence In this figure, nodes N1 and N2 are in Datacenter 1 at Site 1, while N3 and N4 are in Datacenter 2 in Site 2. In the package configuration file, you need to specify the failover sequence such that N1 of Site 1 is followed by a node in Site 2. In this figure, you need to specify that N1 is followed by N3. Similarly, specify that N2 of Site 1 is followed by N4.
PAGE 37
IMPORTANT: It is recommended that you maintain an equal number of nodes at both sites during maintenance. Configuring the RAID Monitoring Service By default, RAID monitoring service is available as part of the package configuration file. If you need to configure more than one serviceguard-xdc package running in the same cluster, edit the service_name field as service_name. It must be unique for all the packages running on same Serviceguard cluster.
PAGE 38
PAGE 39
6 Disaster Scenarios and Their Handling The previous chapters provided information on deploying Software RAID in your environment. In this chapter, you will find information on how Software RAID addresses various disaster scenarios. All the disaster scenarios described in this section have the following three categories: • Disaster Scenario Describes the type of disaster and provides details regarding the cause and the sequence of failures leading to the disasters in the case of multiple failures.
PAGE 40
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Recovery Process Occurs A package (P1) is running on a node (Node 1). The package uses a mirror (md0) that consists of two storage components - S1 (local to Node 1 -/dev/hpdev/mylink-sde ) and S2 (local to Node 2) The package (P1) fails over to Node Complete the following procedure to initiate a 2 and starts running with the mirror recovery: of md0 that consists of only the 1.
PAGE 41
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Recovery Process Occurs This is a multiple failure scenario where the failures occur in a particular sequence in the configuration that corresponds to figure 2 where Ethernet and FC links do not go over DWDM. The package (P1) continues to run on Node 1 after the first failure, with the MD0 that consists of only S1. The RPO_TARGET for the package P1 is set to IGNORE.
PAGE 42
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Recovery Process Occurs In this case, the package The package (P1) continues to run In this scenario, no attempts are made to repair the (P1) runs with RPO_TARGET on N1 with md0 consisting of only first failure until the second failure occurs. Complete S1 after the first failure the following procedure to initiate a recovery: set to 60 seconds. 1.
PAGE 43
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Recovery Process Occurs In this case, the package When the first failure occurs, the Complete the following steps to initiate a recovery: (P1) runs with RPO-TARGET package (P1) continues to run on 1. Restore the FC links between the data centers. N1 with md0 consisting of only S1. set to 60 seconds.
PAGE 44
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Recovery Process Occurs VxVM disk groups: A package (P1) is running on a node (N1). Node N1 experiences a failure. The package (P1) fails over to another node (N2). This node (N2) is configured to take over the package when it fails on node N1. As the network and both the mirrored disk sets are accessible on node N2, and were also accessible when node N1 failed.
PAGE 45
Table 4 Disaster Scenarios and Their Handling (continued) Disaster Scenario What Happens When This Disaster Recovery Process Occurs A package (P1) is running on a node N1. The package uses a VxVM mirror across sites that consists of two plexes, that is, S1 (local to node N1) and S2 (local to node N2). The package (P1) fails over to node To initiate a recovery: N2 and starts running with only one 1. Restore data center 1, node N1, and storage plex of VxVM mirror that consists of S1.
PAGE 46
PAGE 47
7 Troubleshooting This chapter describes how to troubleshoot issues related to serviceguard-xdc package. 7.1 Troubleshooting serviceguard-xdc packages Symptom: /dev/hpdev/mylink-sde cannot be added to /dev/md0 which means /dev/md0 is running with only one mirror half. The following message is logged into the package log file: mdadm: /dev/hpdev/mylink-sde reports being an active member for /dev/md0, but a --re-add fails. mdadm: not performing --add as that would convert /dev/hpdev/mylink-sde in to a spare.
PAGE 48
One of the underlying disk of the diskgroup configured in the serviceguard-xdc configuration is not accessible. The volume is no longer mirrored. Solution: Check if the disk is accessible. Then, reattach the site and recover the diskgroup. • Symptom: The following message is logged in the package log file: WARNING: VxVM mirroring with more than two plexes is not supported with serviceguard-xdc Serviceguard Alert message sent to: test1@test.
PAGE 49
A Managing an MD Device This chapter includes additional information on how to manage the MD device. For the latest information on how to manage and MD device, see The Software-RAID HOWTO manual available at http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html.
PAGE 50
Example 1 Stopping the MD Device /dev/md0 To stop the MD device /dev/md0, run the following command: [root@dlhct1 dev]# mdadm -S /dev/md0 Once you stop the device, the entry is removed from the/proc/mdstat file.
PAGE 51
hpdev/mylink-sdc \ disk/by-id/scsi-3600805f3000b9510a6d7f8a6cdb70054-part1 \ disk/by -path/pci-0000:06:01.0-scsi-0:0:1:30-part1 Run the following command to remove a failed component device from the MD array: # mdadm - -remove In this example: # mdadm --remove /dev/md0 /dev/hpdev/mylink-sdc1 This command removes the failed mirrored disk from the array.
PAGE 52
# echo 50000 > /proc/sys/dev/raid/speed_limit_min or # sysctl -w dev.raid.speed_limit_min=50000 A.6 Enabling and Disabling Write Intent Internal Bitmap You can either enable or disable write intent internal bitmap from an active array as follows: 1. To enable bitmap for an MD device /dev/md0: #mdadm --grow --bitmap=internal /dev/md0 2.
PAGE 53
Index C S cluster FibreChannel, 17 cluster maintenance, 13 configuring VxVM diskgroups, 26 Serviceguard, 9 serviceguard-xdc prerequisites, 21 Software RAID guidelines, 26 understanding, 21 D data center, 10 data replication FibreChannel, 17 VxVM mirroring, 15 disaster recovery architecture, 10 automating, 12 definition, 9 evaluating need, 9 FibreChannel cluster, 17 managing, 12 services, 12 staffing and training, 13 V Volume Groups Creating, 25 E evaluating need for disaster recovery solution, 9 Exten