Managing HP Serviceguard A.11.20.20 for Linux, May 2013

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux ProLiant Cluster

Managing HP Serviceguard A.11.20.20 for

Linux

HP Part Number: 701460-003

Published: May 2013

Summary of content (312 pages)

PAGE 1
Managing HP Serviceguard A.11.20.
PAGE 2
© Copyright 2006, 2013 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use, or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor’s standard commercial license. The information contained herein is subject to change without notice.
PAGE 3
Contents Printing History ..........................................................................................15 Preface......................................................................................................17 1 Serviceguard for Linux at a Glance.............................................................19 1.1 What is Serviceguard for Linux? .........................................................................................19 1.1.1 Failover.......................................
PAGE 4
3.1.2.1 What is WBEM?.................................................................................................37 3.1.2.2 Support for Serviceguard WBEM Provider..............................................................37 3.1.2.3 WBEM Query....................................................................................................37 3.1.2.4 WBEM Indications..............................................................................................38 3.2 How the Cluster Manager Works ....
PAGE 5
3.5.9 Package Switching and Relocatable IP Addresses..........................................................69 3.5.10 Address Resolution Messages after Switching on the Same Subnet .................................70 3.5.11 VLAN Configurations................................................................................................70 3.5.11.1 What is VLAN?.................................................................................................70 3.5.11.2 Support for Linux VLAN...............
PAGE 6
4.7.3.2 What Is IPv6-Only Mode?...................................................................................88 4.7.3.2.1 Rules and Restrictions for IPv6-Only Mode......................................................89 4.7.3.2.2 Recommendations for IPv6-Only Mode..........................................................90 4.7.3.3 What Is Mixed Mode?........................................................................................90 4.7.3.3.1 Rules and Restrictions for Mixed Mode.................
PAGE 7
4.8.13 Configuring a Package: Next Steps..........................................................................132 4.9 Planning for Changes in Cluster Size.................................................................................132 5 Building an HA Cluster Configuration........................................................135 5.1 Preparing Your Systems ...................................................................................................135 5.1.
PAGE 8
5.2.8.4 Setting up Access-Control Policies......................................................................160 5.2.8.4.1 Role Conflicts...........................................................................................162 5.2.8.5 Package versus Cluster Roles.............................................................................163 5.2.9 Verifying the Cluster Configuration ............................................................................163 5.2.
PAGE 9
6.1.4.31 service_halt_timeout........................................................................................184 6.1.4.32 generic_resource_name...................................................................................184 6.1.4.33 generic_resource_evaluation_type.....................................................................185 6.1.4.34 generic_resource_up_criteria............................................................................185 6.1.4.35 vgchange_cmd......................
PAGE 10
7.1.11.6 Status After Halting a Node...............................................................................206 7.1.11.7 Viewing Information about Unowned Packages.....................................................207 7.1.12 Checking the Cluster Configuration and Components...................................................207 7.1.12.1 Verifying Cluster and Package Components..........................................................208 7.1.12.2 Setting up Periodic Cluster Verification..............
PAGE 11
7.6.4.4 Example: Deleting a Subnet Used by a Package..................................................232 7.6.5 Updating the Cluster Lock LUN Configuration Online....................................................233 7.6.6 Changing MAX_CONFIGURED_PACKAGES...............................................................233 7.7 Configuring a Legacy Package..........................................................................................233 7.7.1 Creating the Legacy Package Configuration ...............
PAGE 12
8.7.2.1 Sample System Log Entries ................................................................................255 8.7.3 Reviewing Configuration Files ...................................................................................256 8.7.4 Reviewing the Package Control Script ........................................................................256 8.7.5 Using the cmquerycl and cmcheckconf Commands......................................................256 8.7.6 Reviewing the LAN Configuration ......
PAGE 13
A.5 Handling Application Failures .........................................................................................273 A.5.1 Create Applications to be Failure Tolerant ..................................................................273 A.5.2 Be Able to Monitor Applications ..............................................................................274 A.6 Minimizing Planned Downtime ........................................................................................274 A.6.
PAGE 14
F Maximum and Minimum Values for Parameters...........................................295 G Monitoring Script for Generic Resources...................................................297 G.1 Launching Monitoring Scripts...........................................................................................297 G.2 Template of a Monitoring Script.......................................................................................299 H HP Serviceguard Toolkit for Linux.....................................
PAGE 15
Printing History Table 1 Printing Date Part Number Edition November 2001 B9903-90005 First November 2002 B9903-90012 First December 2002 B9903-90012 Second November 2003 B9903-90033 Third February 2005 B9903-90043 Fourth June 2005 B9903-90046 Fifth August 2006 B9903-90050 Sixth July 2007 B9903-90054 Seventh March 2008 B9903-90060 Eighth April 2009 B9903-90068 Ninth July 2009 B9903-90073 Tenth June 2012 701460-001 NA December 2012 701460-002 NA May 2013 701460-003
PAGE 16
PAGE 17
Preface This guide describes how to configure and manage Serviceguard for Linux on HP ProLiant server under the Linux operating system. It is intended for experienced Linux system administrators. (For Linux system administration tasks that are not specific to Serviceguard, use the system administration documentation and manpages for your distribution of Linux.) The contents are as follows: • Chapter 1 (page 19) describes a Serviceguard cluster and provides a roadmap for using this guide.
PAGE 18
Information about supported configurations is in the HP Serviceguard for Linux Configuration Guide. For updated information on supported hardware and Linux distributions refer to the HP Serviceguard for Linux Certification Matrix. Both documents are available at: http://www.hp.com/info/sglx Problem Reporting If you have any problems with the software or documentation, please contact your local Hewlett-Packard Sales Office or Customer Service Center.
PAGE 19
1 Serviceguard for Linux at a Glance This chapter introduces Serviceguard for Linux and shows where to find different kinds of information in this book. It includes the following topics: • What is Serviceguard for Linux? (page 19) • Using Serviceguard for Configuring in an Extended Distance Cluster Environment (page 21) • Using Serviceguard Manager (page 22) • Configuration Roadmap (page 22) If you are ready to start setting up Serviceguard clusters, skip ahead to Chapter 4 (page 79).
PAGE 20
Figure 1 Typical Cluster Configuration In the figure, node 1 (one of two SPU's) is running package A, and node 2 is running package B. Each package has a separate group of disks associated with it, containing data needed by the package's applications, and a copy of the data. Note that both nodes are physically connected to disk arrays. However, only one node at a time may access the data for a given group of disks.
PAGE 21
Figure 2 Typical Cluster After Failover After this transfer, the package typically remains on the adoptive node as long the adoptive node continues running. If you wish, however, you can configure the package to return to its primary node as soon as the primary node comes back online. Alternatively, you may manually transfer control of the package back to the primary node at the appropriate time. Figure 2 (page 21) does not show the power connections to the cluster, but these are important as well.
PAGE 22
1.3 Using Serviceguard Manager NOTE: For more information, see Appendix E (page 291), and the section on Serviceguard Manager in the latest version of the Serviceguard Release Notes. For more information about Serviceguard Manager compatibility, see Serviceguard/Serviceguard Manager Plug-in Compatibility and Feature Matrix and the latest Release Notes at http://www.hp.com/go/hpux-serviceguard-docs (Select HP Serviceguard). Serviceguard Manager is the graphical user interface for Serviceguard.
PAGE 23
Figure 3 Tasks in Configuring a Serviceguard Cluster HP recommends that you gather all the data that is needed for configuration before you start. See Chapter 4 (page 79) for tips on gathering data. 1.
PAGE 24
PAGE 25
2 Understanding Hardware Configurations for Serviceguard for Linux This chapter gives a broad overview of how the server hardware components operate with Serviceguard for Linux. The following topics are presented: • Redundant Cluster Components • Redundant Network Components (page 25) • Redundant Disk Storage (page 29) • Redundant Power Supplies (page 30) Refer to the next chapter for information about Serviceguard software components. 2.
PAGE 26
2.2.1 Rules and Restrictions • A single subnet cannot be configured on different network interfaces (NICs) on the same node. • In the case of subnets that can be used for communication between cluster nodes, the same network interface must not be used to route more than one subnet configured on the same node. • For IPv4 subnets, Serviceguard does not support different subnets on the same LAN interface. ◦ For IPv6, Serviceguard supports up to two subnets per LAN interface (site-local and global).
PAGE 27
Figure 4 Redundant LANs In Linux configurations, the use of symmetrical LAN configurations is strongly recommended, with the use of redundant hubs or switches to connect Ethernet segments. The software bonding configuration should be identical on each node, with the active interfaces connected to the same hub or switch. 2.2.3 Cross-Subnet Configurations As of Serviceguard A.11.
PAGE 28
• You should not use the wildcard (*) for node_name in the package configuration file, as this could allow the package to fail over across subnets when a node on the same subnet is eligible; failing over across subnets can take longer than failing over on the same subnet. List the nodes in order of preference instead of using the wildcard. • You should configure IP monitoring for each subnet; see “Monitoring LAN Interfaces and Detecting Failure: IP Level” (page 66). 2.2.3.
PAGE 29
IMPORTANT: Although cross-subnet topology can be implemented on a single site, it is most commonly used by extended-distance clusters and Metrocluster. For more information about such clusters, see the following documents at http://www.hp.
PAGE 30
• Only IPv4 networks support iSCSI storage devices. • HP recommends that you do not use heartbeat LAN for iSCSI storage device. The following restrictions are applicable when iSCSI LUNs are used as a shared storage: • An iSCSI storage device does not support configuring a lock LUN. • Hardware initiator models does not support iSCSI storage. • An iSCSI storage device that are exposed using SCSI targets is not supported. 2.3.
PAGE 31
representative can provide more details about the layout of power supplies, disks, and LAN hardware for clusters. 2.
PAGE 32
PAGE 33
3 Understanding Serviceguard Software Components This chapter gives a broad overview of how the Serviceguard software components work.
PAGE 34
• cmlogd—cluster system log daemon • cmdisklockd—cluster lock LUN daemon • cmresourced—Serviceguard Generic Resource Assistant daemon • cmprd—Persistent Reservation daemon • cmserviced—Service Assistant daemon • qs—Quorum Server daemon • cmlockd—utility daemon • cmsnmpd—cluster SNMP subagent (optionally running) • cmwbemd—WBEM daemon • cmproxyd—proxy daemon Each of these daemons logs to the Linux system logging files.
PAGE 35
3.1.1.3 Network Manager Daemon: cmnetd This daemon monitors the health of cluster networks. It also handles the addition and deletion of relocatable package IPs, for both IPv4 and IPv6 addresses. 3.1.1.4 Log Daemon: cmlogd cmlogd is used by cmcld to write messages to the system log file. Any message written to the system log by cmcld it written through cmlogd. This is to prevent any delays in writing to syslog from impacting the timing of cmcld. The path for this daemon is $SGLBIN/cmlogd. 3.1.1.
PAGE 36
is killed. It can also be configured as a Serviceguard package in a cluster other than the one(s) it serves; see Figure 9 (page 42). All members of the cluster initiate and maintain a connection to the quorum server; if it dies, the Serviceguard nodes will detect this and then periodically try to reconnect to it. If there is a cluster re-formation while the quorum server is down and tie-breaking is needed, the re-formation will fail and all the nodes will halt (system reset).
PAGE 37
3.1.2 Serviceguard WBEM Provider 3.1.2.1 What is WBEM? Web-Based Enterprise Management (WBEM) is a set of management and Internet standard technologies developed to unify the management of distributed computing environments, facilitating the exchange of data across otherwise disparate technologies and platforms.
PAGE 38
• HP_SGNodePackage • HP_SGPService • HP_SGPackagePService • HP_SGNodePService • HP_SGLockLunDisk • HP_SGRemoteQuorumService • HP_SGLockObject • HP_SGQuorumServer • HP_SGLockLun • HP_SGLockDisk For more information about WBEM provider classes, see Managed Object Format (MOF) files for properties.
PAGE 39
3.2.2 Heartbeat Messages Central to the operation of the cluster manager is the sending and receiving of heartbeat messages among the nodes in the cluster. Each node in the cluster exchanges UDP heartbeat messages with every other node over each IP network configured as a heartbeat device. If a cluster node does not receive heartbeat messages from all other cluster nodes within the prescribed time, a cluster re-formation is initiated; see “What Happens when a Node Times Out” (page 75).
PAGE 40
3.2.5 Dynamic Cluster Re-formation A dynamic re-formation is a temporary change in cluster membership that takes place as nodes join or leave a running cluster. Re-formation differs from reconfiguration, which is a permanent modification of the configuration files. Re-formation of the cluster occurs under the following conditions (not a complete list): • An SPU or network failure was detected on an active node. • An inactive node wants to join the cluster.
PAGE 41
When a node obtains the cluster lock, this partition is marked so that other nodes will recognize the lock as “taken.” NOTE: • The lock LUN is dedicated for use as the cluster lock, and, in addition, HP recommends that this LUN comprise the entire disk; that is, the partition should take up the entire disk. • An iSCSI storage device does not support configuring a lock LUN. The complete path name of the lock LUN is identified in the cluster configuration file.
PAGE 42
Figure 8 Quorum Server Operation A quorum server can provide quorum services for multiple clusters. Figure 9 illustrates quorum server use across four clusters. Figure 9 Quorum Server to Cluster Distribution IMPORTANT: For more information about the quorum server, see the latest version of the HP Serviceguard Quorum Server release notes at http://www.hp.com/go/hpux-serviceguard-docs (Select HP Serviceguard Quorum Server Software). 3.2.
PAGE 43
In a cluster with four or more nodes, you may not need a cluster lock since the chance of the cluster being split into two halves of equal size is very small. However, be sure to configure your cluster to prevent the failure of exactly half the nodes at one time. For example, make sure there is no potential single point of failure such as a single LAN between equal numbers of nodes, and that you don’t have exactly half of the nodes on a single power circuit. 3.2.
PAGE 44
cluster, and the multi-node package, which can be configured to run on all or some of the nodes in the cluster. System multi-node packages are reserved for use by HP-supplied applications. The rest of this section describes failover packages. 3.3.1.2 Failover Packages A failover package starts up on an appropriate node (see node_name (page 176)) when the cluster starts. In the case of a service, network, or other resource or dependency failure, package failover takes place.
PAGE 45
Failover packages list the nodes in order of priority (i.e., the first node in the list is the highest priority node). In addition, failover packages’ files contain three parameters that determine failover behavior. These are the auto_run parameter, the failover_policy parameter, and the failback_policy parameter. 3.3.1.2.
PAGE 46
Figure 11 Before Package Switching In Figure 12, node1 has failed and pkg1 has been transferred to node2. pkg1's IP address was transferred to node2 along with the package. pkg1 continues to be available and is now running on node2. Also note that node2 now has access both to pkg1's disk and pkg2's disk. NOTE: For design and configuration information about clusters that span subnets, see the documents listed under “Cross-Subnet Configurations” (page 27).
PAGE 47
Figure 12 After Package Switching 3.3.1.2.4 Failover Policy The Package Manager selects a node for a failover package to run on based on the priority list included in the package configuration file together with the failover_policy parameter, also in the configuration file. The failover policy governs how the package manager selects which node to run a package on when a specific node has not been identified and the package needs to be started.
PAGE 48
Table 2 Package Configuration Data Package Name NODE_NAME List FAILOVER_POLICY pkgA node1, node2, node3, node4 min_package_node pkgB node2, node3, node4, node1 min_package_node pkgC node3, node4, node1, node2 min_package_node When the cluster starts, each package starts as shown in Figure 13.
PAGE 49
Figure 14 Rotating Standby Configuration after Failover NOTE: Under the min_package_node policy, when node2 is repaired and brought back into the cluster, it will then be running the fewest packages, and thus will become the new standby node. If these packages had been set up using the configured_node failover policy, they would start initially as in Figure 13, but the failure of node2 would cause the package to start on node3, as shown in Figure 15. 3.
PAGE 50
Figure 15 configured_node Policy Packages after Failover If you use configured_node as the failover policy, the package will start up on the highest-priority eligible node in its node list. When a failover occurs, the package will move to the next eligible node in the list, in the configured order of priority. 3.3.1.2.
PAGE 51
Figure 16 Automatic Failback Configuration before Failover Table 3 Node Lists in Sample Cluster Package Name NODE_NAME List FAILOVER POLICY FAILBACK POLICY pkgA node1, node4 configured_node automatic pkgB node2, node4 configured_node automatic pkgC node3, node4 configured_node automatic node1 panics, and after the cluster reforms, pkgA starts running on node4: 3.
PAGE 52
Figure 17 Automatic Failback Configuration After Failover After rebooting, node1 rejoins the cluster. At that point, pkgA will be automatically stopped on node4 and restarted on node1.
PAGE 53
NOTE: Setting the failback_policy to automatic can result in a package failback and application outage during a critical production period. If you are using automatic failback, you may want to wait to add the package’s primary node back into the cluster until you can allow the package to be taken out of service temporarily while it switches back to the primary node. Serviceguard automatically chooses a primary node for a package when the NODE_NAME is set to '*'.
PAGE 54
If there is a common generic resource that needs to be monitored as a part of multiple packages, then the monitoring script for that resource can be launched as part of one package and all other packages can use the same monitoring script. There is no need to launch multiple monitors for a common resource. If the package that has started the monitoring script fails or is halted, then all the other packages that are using this common resource also fail.
PAGE 55
3.4.1 What Makes a Package Run? There are 3 types of packages: • The failover package is the most common type of package. It runs on one node at a time. If a failure occurs, it can switch to another node listed in its configuration file. If switching is enabled for several nodes, the package manager will use the failover policy to determine where to start the package. • A system multi-node package runs on all the active cluster nodes at the same time.
PAGE 56
Figure 19 Legacy Package Time Line Showing Important Events The following are the most important moments in a package’s life: 1. Before the control script starts. (For modular packages, this is the master control script.) 2. During run script execution. (For modular packages, during control script execution to start the package.) 3. While services are running 4. If there is a generic resource configured and it fails, then the package will be halted. 5.
PAGE 57
3.4.3 During Run Script Execution Once the package manager has determined that the package can start on a particular node, it launches the script that starts the package (that is, a package’s control script or master control script is executed with the start parameter). This script carries out the following steps: 1. 2. 3. 4. 5. 6. 7. Executes any external_pre_scripts (modular packages only; see “About External Scripts” (page 127)) Activates volume groups or disk groups. Mounts file systems.
PAGE 58
Normal starts are recorded in the log, together with error messages or warnings related to starting the package. NOTE: After the package run script has finished its work, it exits, which means that the script is no longer executing once the package is running normally. After the script exits, the PIDs of the services started by the script are monitored by the package manager directly.
PAGE 59
• Process IDs of the services • Subnets configured for monitoring in the package configuration file • Generic resources configured for monitoring in the package configuration file If a service fails but the restart parameter for that service is set to a value greater than 0, the service will restart, up to the configured number of restarts, without halting the package.
PAGE 60
1. 2. 3. 4. 5. 6. 7. 8. Halts all package services. Executes any customer-defined halt commands (legacy packages only) or external_scripts (modular packages only; see “external_script” (page 190)). Removes package IP addresses from the LAN card on the node. Unmounts file systems. Deactivates volume groups. Revokes Persistent registrations and reservations, if any Exits with an exit code of zero (0). Executes any external_pre_scripts (modular packages only; see “external_pre_script” (page 190)).
PAGE 61
• 0—normal exit. The package halted normally, so all services are down on this node. • 1—abnormal exit, also known as no_restart exit. The package did not halt normally. Services are killed, and the package is disabled globally. It is not disabled on the current node, however. • 2 — abnormal exit, also known as restart exit. The package did not halt normally. Services are killed, and the package is disabled globally. It is not disabled on the current node, however.
PAGE 62
Table 4 Error Conditions and Package Movement for Failover Packages (continued) Package Error Condition Results Error or Exit Code Node Failfast Service Enabled Failfast Enabled Linux Status on Primary after Error Halt script Package Allowed to Package runs after Run on Primary Allowed to Run Error or Exit Node after Error on Alternate Node Loss of Network No Either Setting Running Yes Yes Yes package depended on failed Either Setting Either Setting Running Yes Yes when dependency is again
PAGE 63
Because system multi-node and multi-node packages do not fail over, they do not have relocatable IP address. A relocatable IP address is like a virtual host IP address that is assigned to a package. HP recommends that you configure names for each package through DNS (Domain Name System). A program then can use the package’s name like a host name as the input to gethostbyname(3), which will return the package’s relocatable IP address.
PAGE 64
the others are available as backups. If one interface fails, another interface in the bonded group takes over. HP strongly recommends you use channel bonding in each critical IP subnet to achieve highly available network services. Host Bus Adapters (HBAs) do not have to be identical. Ethernet LANs must be the same type, but can be of different bandwidth (for example, 1 Gb and 100 Mb). Serviceguard for Linux supports the use of bonding of LAN interfaces at the driver level.
PAGE 65
Figure 23 Bonded NICs Node2 Node1 bond0: bond0: eth0 eth1 eth0 eth1 active active Hub Crossover cable Hub In the bonding model, individual Ethernet interfaces are slaves, and the bond is the master. In the basic high availability configuration (mode 1), one slave in a bond assumes an active role, while the others remain inactive until a failure is detected. (In Figure 3-18, both eth0 slave interfaces are active.
PAGE 66
on-board LAN interfaces) must be used in any combination of channel bonds to avoid a single point of failure for heartbeat connections. 3.5.5 Bonding for Load Balancing It is also possible to configure bonds in load balancing mode, which allows all slaves to transmit data in parallel, in an active/active arrangement. In this case, high availability is provided by the fact that the bond still continues to function (with less throughput) if one of the component LANs should fail.
PAGE 67
• Detects when a network interface fails to send or receive IP messages, even though it is still up at the link level. • Handles the failure, failover, recovery, and failback. 3.5.7.
PAGE 68
16.89.120.0 … Possible IP Monitor Subnets: IPv4: 16.89.112.0 Polling Target 16.89.112.1 IPv6: 3ffe:1000:0:a801:: Polling Target 3ffe:1000:0:a801::254 … The IP Monitor section of the cluster configuration file will look similar to the following for a subnet on which IP monitoring is configured with target polling. IMPORTANT: By default, the cmquerycl does not verify that the gateways it detects will work correctly for monitoring.
PAGE 69
3.5.7.3 Constraints and Limitations • A subnet must be configured into the cluster in order to be monitored. • Polling targets are not detected beyond the first-level router. • Polling targets must accept and respond to ICMP (or ICMPv6) ECHO messages. • A peer IP on the same subnet should not be a polling target because a node can always ping itself.
PAGE 70
NOTE: It is possible to configure a cluster that spans subnets joined by a router, with some nodes using one subnet and some another. This is called a cross-subnet configuration.
PAGE 71
• A maximum of 30 network interfaces per node is supported. The interfaces can be physical NIC ports, VLAN interfaces, Channel Bonds, or any combination of these. • Only port-based and IP-subnet-based VLANs are supported. Protocol-based VLAN is not supported because Serviceguard does not support any transport protocols other than TCP/IP. • Each VLAN interface must be assigned an IP address in a unique subnet. • Using VLAN in a Wide Area Network cluster is not supported. 3.5.11.
PAGE 72
Figure 26 Physical Disks Combined into LUNs NOTE: LUN definition is normally done using utility programs provided by the disk array manufacturer. Since arrays vary considerably, you should refer to the documentation that accompanies your storage unit. For information about configuring multipathing, see “Multipath for Storage ” (page 82). 3.6.2 Monitoring Disks Each package configuration includes information about the disks that are to be activated by the package at startup.
PAGE 73
Unlike exclusive activation for volume groups, which does not prevent unauthorized access to the underlying LUNs, PR controls access at the LUN level. Registration and reservation information is stored on the device and enforced by its firmware; this information persists across device resets and system reboots. NOTE: Persistent Reservations coexist with, and are independent of, activation protection of volume groups.
PAGE 74
• If you are using a storage device that does not support SPC-3 PR, disable the PR support using the FORCED_PR_DISABLE flag in the cluster configuration. • If you are using Serviceguard Manager for creating modular packages, PR module is displayed as optional. However, HP recommends that you always enable PR module for creating the modular packages. • The udev alias names must be created using symlinks.
PAGE 75
All initiators on each node running the package register with LUN devices using the same PR Key, known as the node_pr_key. Each node in the cluster has a unique node_pr_key, which you can see in the output of cmviewcl -f line; for example: ... node:bla2|node_pr_key=10001 When a failover package starts up, any existing PR keys and reservations are cleared from the underlying LUN devices first; then the node_pr_key of the node that the package is starting on is registered with each LUN.
PAGE 76
3.8.1.1.1 Example Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02is exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB respectively. Failure. Only one LAN has been configured for both heartbeat and data traffic.
PAGE 77
in the appropriate switching behavior. Power protection is provided by HP-supported uninterruptible power supplies (UPS). 3.8.3 Responses to Package and Service Failures In the default case, the failure of a package, a generic resource or service of the package or of a service within a package causes the package to shut down by running the control script with the stop parameter, and then restarting the package on an alternate node.
PAGE 78
“Choosing Switching and Failover Behavior” (page 107) provides advice on choosing appropriate failover behavior. See “Parameters for Configuring Generic Resources” (page 108). 3.8.4.1 Service Restarts You can allow a service to restart locally following a failure. To do this, you indicate a number of restarts for each service in the package control script. When a service starts, the variable service_restart is set in the service’s environment.
PAGE 79
4 Planning and Documenting an HA Cluster Building a Serviceguard cluster begins with a planning phase in which you gather and record information about all the hardware and software components of the configuration.
PAGE 80
your cluster without having to bring it down, you need to plan the initial configuration carefully. Use the following guidelines: • Set the Maximum Configured Packages parameter (described later in this chapter under “Cluster Configuration Planning ” (page 86)) high enough to accommodate the additional packages you plan to add. • Networks should be pre-configured into the cluster configuration if they will be needed for packages you will add later while the cluster is running.
PAGE 81
4.2.2 Supported cluster configuration options Following are the supported cluster configuration options when using VMWare or KVM guests as cluster nodes: • Cluster with VMware or KVM guests from a single host as cluster nodes (cluster-in-a-box; not recommended) NOTE: This configuration is not recommended because failure of the host brings down all the nodes in the cluster which is a single point of failure.
PAGE 82
Subnet Name The IP address for the subnet. Note that heartbeat IP addresses must be on the same subnet on each node. Interface Name The name of the LAN card as used by this node to access the subnet. This name is shown by ifconfig after you install the card. IP Address The IP address to be used on this interface. An IPv4 address is a string of 4 digits separated with decimals, in this form: nnn.nnn.nnn.
PAGE 83
multipath failover mode on Linux systems application note”, which you can find by entering the terms qlogic multipath application into the search box of www.hp.com. NOTE: With the rapid evolution of Linux, the multipath mechanisms may change, or new ones may be added. Serviceguard for Linux supports DeviceMapper multipath (DM-MPIO) with some restrictions; see the Serviceguard for Linux Certification Matrix at the address provided in the Preface to this manual for up-to-date information.
PAGE 84
4.4 Power Supply Planning There are two sources of power for your cluster which you will have to consider in your design: line power and uninterruptible power supplies (UPS). Loss of a power circuit should not bring down the cluster. Frequently, servers, mass storage devices, and other hardware have two or three separate power supplies, so they can survive the loss of power to one or more power supplies or power circuits.
PAGE 85
4.5.1 Cluster Lock Requirements A one-node cluster does not require a lock. Two-node clusters require the use of a cluster lock, and a lock is recommended for larger clusters as well. Clusters larger than four nodes can use only a quorum server as the cluster lock. For information on configuring lock LUNs and the Quorum Server, see “Setting up a Lock LUN” (page 143), section “Specifying a Lock LUN” (page 155), and HP Serviceguard Quorum Server Version A.04.00, or later Release Notes at http://www.hp.
PAGE 86
• You must not group two different high availability applications, services, or data, whose control needs to be transferred independently, on the same volume group. • Your root disk must not belong to a volume group that can be activated on another node. 4.6.
PAGE 87
NOTE: After the configuration is complete, you cannot add the nodes. • Does not set up lock LUN or quorum server. • Does not ensure that all other network connections between the servers are valid. Before You Start IMPORTANT: The nodes which are given as inputs should not have cluster configured in them. Before you start, you should have done the planning and preparation as described in previous sections.
PAGE 88
4.7.2 Heartbeat Subnet and Cluster Re-formation Time The speed of cluster re-formation depends on the number of heartbeat subnets. If the cluster has only a single heartbeat network, and a network card on that network fails, heartbeats will be lost while the failure is being detected and the IP address is being switched to a standby interface. The cluster may treat these lost heartbeats as a failure and re-form without one or more nodes.
PAGE 89
to IPv6 addresses. The single exception to this is each node's IPv4 loopback address, which cannot be removed from /etc/hosts. NOTE: How the clients of IPv6-only cluster applications handle hostname resolution is a matter for the discretion of the system or network administrator; there are no HP requirements or recommendations specific to this case.
PAGE 90
NOTE: This applies to all IPv6 addresses, whether HOSTNAME_ADDRESS_FAMILY is set to IPV6 or ANY. • Cross-subnet configurations are not supported in IPv6-only mode. • Virtual machines are not supported. You cannot have a virtual machine that is either a node or a package if HOSTNAME_ADDRESS_FAMILY is set to ANY or IPV6. 4.7.3.2.2 Recommendations for IPv6-Only Mode IMPORTANT: Check the latest Serviceguard for Linux release notes for the latest instructions and recommendations.
PAGE 91
• Cross-subnet configurations are not supported. This also applies if HOSTNAME_ADDRESS_FAMILY is set to IPV6. See “Cross-Subnet Configurations” (page 27) for more information about such configurations. • Virtual machines are not supported. You cannot have a virtual machine that is either a node or a package if HOSTNAME_ADDRESS_FAMILY is set to ANY or IPV6. 4.7.4 Cluster Configuration Parameters You need to define a set of cluster parameters.
PAGE 92
Quorum Server host names. Valid values are IPV4, IPV6, and ANY. The default is IPV4. • IPV4 means Serviceguard will try to resolve the names to IPv4 addresses only. • IPV6 means Serviceguard will try to resolve the names to IPv6 addresses only. • ANY means Serviceguard will try to resolve the names to both IPv4 and IPv6 addresses. IMPORTANT: See “About Hostname Address Families: IPv4-Only, IPv6-Only, and Mixed Mode” (page 88) for important information.
PAGE 93
information, see “Cluster Lock Planning” (page 84) and “Specifying a Quorum Server” (page 155). IMPORTANT: For special instructions that may apply to your version of Serviceguard and the Quorum Server see “Configuring Serviceguard to Use the Quorum Server” in the latest version HP Serviceguard Quorum Server Version A.04.00 Release Notes, at http://www.hp.com/go/ hpux-serviceguard-docs (Select HP Serviceguard Quorum Server Software).
PAGE 94
Do not use the full domain name. For example, enter ftsys9, not ftsys9.cup.hp.com. A cluster can contain up to 16 nodes.
PAGE 95
IMPORTANT: SITE must be 39 characters or less, and are case-sensitive; each SITE entry must exactly match with one of the SITE_NAME entries. Duplicate SITE entries are not allowed. NETWORK_INTERFACE The name of each LAN that will be used for heartbeats or for user data on the node identified by the preceding NODE_NAME. An example is eth0. See also HEARTBEAT_IP, STATIONARY_IP, and “About Hostname Address Families: IPv4-Only, IPv6-Only, and Mixed Mode” (page 88).
PAGE 96
For more details of the IPv6 address format, see “IPv6 Address Types” (page 285). Heartbeat IP addresses on a given subnet must all be of the same type: IPv4 or IPv6 site-local or IPv6 global. For information about changing the configuration online, see “Changing the Cluster Networking Configuration while the Cluster Is Running” (page 230).
PAGE 97
NOTE: IPv6 heartbeat subnets are not supported in a cross-subnet configuration. NOTE: The use of a private heartbeat network is not advisable if you plan to use Remote Procedure Call (RPC) protocols and services. RPC assumes that each network adapter device or I/O card is connected to a route-able network. An isolated or private heartbeat LAN is not route-able, and could cause an RPC request-reply, directed to that LAN, to timeout without being serviced.
PAGE 98
package weight to determine if the package can run on that node. CAPACITY_NAME name can be any string that starts and ends with an alphanumeric character, and otherwise contains only alphanumeric characters, dot (.), dash (-), or underscore (_). Maximum length is 39 characters. CAPACITY_NAME must be unique in the cluster. CAPACITY_VALUE specifies a value for the CAPACITY_NAME that precedes it. It must be a floating-point value between 0 and 1000000.
PAGE 99
If you enter a value greater than 60 seconds (60,000,000 microseconds), cmcheckconf and cmapplyconf will note the fact, as confirmation that you intend to use a large value. Minimum supported values: • 3 seconds for a cluster with more than one heartbeat subnet. • 14 seconds for a cluster that has only one heartbeat LAN With the lowest supported value of 3 seconds, a failover time of 4 to 5 seconds can be achieved.
PAGE 100
microseconds), keeping in mind that a value larger than the default will lead to slower re-formations than the default. A value in this range is appropriate for most installations See also “What Happens when a Node Times Out” (page 75), “Cluster Daemon: cmcld” (page 34), and the white paper Optimizing Failover Time in a Serviceguard Environment (version A.11.19 and later) at http:// www.hp.com/go/linux-serviceguard-docs. Can be changed while the cluster is running.
PAGE 101
The following are the failure/recovery detection times for different values of Network Polling Interval (NPI) for an IP monitored Ethernet interface: Table 5 Failure Recovery Detection Times for an IP Monitored Ethernet Interface Values of Network Polling Failure/Recovery Detection Times (in seconds) Interval (NPI) (in seconds) 1 ~ NPI x 8 - NPI x 9 2 ~ NPI x 4 - NPI x 5 3 ~ NPI x 3 - NPI x 4 4 to 8 ~ NPI x 2 - NPI x 3 >=8 ~ NPI x 1- NPI x 2 IMPORTANT: HP strongly recommends using the default.
PAGE 102
and the cluster nodes, sum the values for each path and use the largest number. CAUTION: Serviceguard supports NFS-mounted file systems only over switches and routers that support MBTD. If you are using NFS-mounted file systems, you must set CONFIGURED_IO_TIMEOUT_EXTENSION as described here. For more information about MBTD, see the white paper Support for NFS as a filesystem type with HP Serviceguard A.11.20 on HP-UX and Linux available at http://www.hp.com/go/linux-serviceguard-docs.
PAGE 103
By default, IP_MONITOR parameter is set to OFF. If a gateway is detected for the SUBNET in question, and POLLING_TARGET entries are populated with the gateway addresses, setting IP_MONITOR parameter to ON enables target polling. For more information, see the description for POLLING_TARGET.
PAGE 104
case, the default can be overridden for an individual package via the weight_name and weight_value parameters in the package configuration file. For more information and examples, see “Defining Weights” (page 124). IMPORTANT: CAPACITY_NAME, WEIGHT_NAME, and weight_value must all match exactly. NOTE: A weight (WEIGHT_NAME, WEIGHT_DEFAULT) has no meaning on a node unless a corresponding capacity (CAPACITY_NAME, CAPACITY_VALUE) is defined for that node.
PAGE 105
NOTE: As of Serviceguard A.11.18, there is a new and simpler way to configure packages. This method allows you to build packages from smaller modules, and eliminates the separate package control script and the need to distribute it manually; see Chapter 6: “Configuring Packages and Their Services ” (page 169), for complete instructions. This manual refers to packages created by the newer method as modular packages, and to packages created by the older method as legacy packages.
PAGE 106
NOTE: Generic resources influence the package based on their status. The actual monitoring of the resource should be done in a script and this must be configured as a service. The script sets the status of the resource based on the availability of the resource. See “Monitoring Script for Generic Resources” (page 297). Create a list by package of volume groups, logical volumes, and file systems. Indicate which nodes need to have access to common file systems at different times.
PAGE 107
• Only NFS client-side locks (local locks) are supported. Server-side locks are not supported. • Because exclusive activation is not available for NFS-imported file systems, you must take the following precautions to ensure that data is not accidentally overwritten. ◦ The server must be configured so that only the cluster nodes have access to the file system. ◦ The NFS file system used by a package must not be imported by any other system, including other nodes in the cluster.
PAGE 108
The following table describes different types of failover behavior and the settings in the package configuration file that determine each behavior. See “Package Parameter Explanations” (page 174) for more information. Table 6 Package Failover Behavior Switching Behavior Parameters in Configuration File Package switches normally after detection of service or network failure, generic resource failure or when a configured dependency is not met. Halt script runs before switch takes place.
PAGE 109
• generic_resource_name: defines the logical name used to identify a generic resource in a package. • generic_resource_evaluation_type: defines when the status of a generic resource is evaluated. This can be set to during_package_start or before_package_start. If not specified, DPS is considered as default. • ◦ during_package_start means the status of generic resources are evaluated during the course of start of the package.
PAGE 110
NOTE: Generic resources must be configured to use the monitoring script. It is the monitoring script that contains the logic to monitor the resource and set the status of a generic resource accordingly by using cmsetresource(1m). These scripts must be written by end-users according to their requirements. The monitoring script must be configured as a service in the package if the monitoring of the resource is required to be started and stopped as a part of the package.
PAGE 111
ATTRIBUTE_NAME Style Priority ATTRIBUTE_VALUE modular no_priority The cmviewcl -v -f line output (snippet) will be as follows: cmviewcl -v -f line -p pkg1 | grep generic_resource generic_resource:sfm_disk|name=sfm_disk generic_resource:sfm_disk|evaluation_type=during_package_start generic_resource:sfm_disk|up_criteria=”N/A” generic_resource:sfm_disk|node:node1|status=unknown generic_resource:sfm_disk|node:node1|current_value=0 generic_resource:sfm_disk|node:node2|status=unknown generic_resource:sfm_disk|n
PAGE 112
4.8.6.2 Online Reconfiguration of Generic Resources Online operations such as addition, deletion, and modification of generic resources in packages are supported. The following operations can be performed online: • Addition of a generic resource of generic_resource_evaluation_type set to during_package_start, whose status is not down. Please ensure that while adding a generic resource, the equivalent monitor is available; if not add the monitor while adding a generic resource.
PAGE 113
Serviceguard adds two new capabilities: you can specify broadly where the package depended on must be running, and you can specify that it must be down. These capabilities are discussed later in this section under “Extended Dependencies” (page 117). You should read the next section, “Simple Dependencies” (page 113), first. 4.8.7.1 Simple Dependencies A simple dependency occurs when one package requires another to be running on the same node.
PAGE 114
• A package cannot depend on itself, directly or indirectly. That is, not only must pkg1 not specify itself in the dependency_condition (page 180), but pkg1 must not specify a dependency on pkg2 if pkg2 depends on pkg1, or if pkg2 depends on pkg3 which depends on pkg1, etc.
PAGE 115
NOTE: Keep the following in mind when reading the examples that follow, and when actually configuring priorities: 1. auto_run (page 176) should be set to yes for all the packages involved; the examples assume that it is. 2. Priorities express a ranking order, so a lower number means a higher priority (10 is a higher priority than 30).
PAGE 116
If pkg1 depends on pkg2, and pkg1’s priority is lower than or equal to pkg2’s, pkg2’s node order dominates. Assuming pkg2’s node order is node1, node2, node3, then: • On startup: ◦ • pkg2 will start on node1, or node2 if node1 is not available or does not at present meet all of its dependencies, etc. – pkg1 will start on whatever node pkg2 has started on (no matter where that node appears on pkg1’s node_name list) provided all of pkg1’s other dependencies are met there.
PAGE 117
Note that the nodes will be tried in the order of pkg1’s node_name list, and pkg2 will be dragged to the first suitable node on that list whether or not it is currently running on another node. • • On failover: ◦ If pkg1 fails on node1, pkg1 will select node2 to fail over to (or node3 if it can run there and node2 is not available or does not meet all of its dependencies; etc.) ◦ pkg2 will be dragged to whatever node pkg1 has selected, and restart there; then pkg1 will restart there.
PAGE 118
• You can specify whether the package depended on must be running or must be down. You define this condition by means of the dependency_condition, using one of the literals UP or DOWN (the literals can be upper or lower case). We'll refer to the requirement that another package be down as an exclusionary dependency; see “Rules for Exclusionary Dependencies” (page 118).
PAGE 119
4.8.7.4.2 Rules for different_node and any_node Dependencies These rules apply to packages whose dependency_condition is UP and whose dependency_location is different_node or any_node. For same-node dependencies, see Simple Dependencies (page 113); for exclusionary dependencies, see “Rules for Exclusionary Dependencies” (page 118). • Both packages must be failover packages whose failover_policy (page 178) is configured_node.
PAGE 120
• these are failover packages, and • the failing package can “drag” these packages to a node on which they can all run. Otherwise the failing package halts and the packages it depends on continue to run 4. Starts the packages the failed package depends on (those halted in step 3, if any).
PAGE 121
4.8.10.3 Simple Method Use this method if you simply want to control the number of packages that can run on a given node at any given time. This method works best if all the packages consume about the same amount of computing resources. If you need to make finer distinctions between packages in terms of their resource consumption, use the Comprehensive Method (page 122) instead. To implement the simple method, use the reserved keyword package_limit to define each node's capacity.
PAGE 122
you wanted to ensure that the larger packages, pkg2 and pkg3, did not run on node1 at the same time, you could raise the weight_value of one or both so that the combination exceeded 10 (or reduce node1's capacity to 8). 4.8.10.3.2 Points to Keep in Mind The following points apply specifically to the Simple Method (page 121). Read them in conjunction with the Rules and Guidelines (page 126), which apply to all weights and capacities.
PAGE 123
memory weight does not exceed 1000. But Serviceguard has no knowledge of the real-world meanings of the names processor and memory; there is no mapping to actual processor and memory usage and you would get exactly the same results if you used the names apples and oranges. For example, suppose you have the following configuration: • A two node cluster running four packages. These packages contend for resource we'll simply call A and B. • node1 has a capacity of 80 for A and capacity of 50 for B.
PAGE 124
NOTE: You do not have to define capacities for every node in the cluster. If any capacity is not defined for any node, Serviceguard assumes that node has an infinite amount of that capacity.
PAGE 125
NOTE: Option 4 means that the package is “weightless” as far as this particular capacity is concerned, and can run even on a node on which this capacity is completely consumed by other packages. (You can make a package “weightless” for a given capacity even if you have defined a cluster-wide default weight; simply set the corresponding weight to zero in the package's cluster configuration file.
PAGE 126
to move; see “How Package Weights Interact with Package Priorities and Dependencies” (page 126)). This is true whenever a package has a weight that exceeds the available amount of the corresponding capacity on the node. 4.8.10.5 Rules and Guidelines The following rules and guidelines apply to both the Simple Method (page 121) and the Comprehensive Method (page 122) of configuring capacities and weights. • You can define a maximum of four capacities, and corresponding weights, throughout the cluster.
PAGE 127
its priority is set to the default, no_priority) will not be halted to make room for a down package that has no priority. Between two down packages without priority, Serviceguard will decide which package to start if it cannot start them both because there is not enough node capacity to support their weight. 4.8.10.7.1 Example 1 • pkg1 is configured to run on nodes turkey and griffon. It has a weight of 1 and a priority of 10. It is down and has switching disabled.
PAGE 128
• 0 - indicating success. • 1 - indicating the package will be halted, and should not be restarted, as a result of failure in this script. • 2 - indicating the package will be restarted on another node, or halted if no other node is available. NOTE: In the case of the validate entry point, exit values 1 and 2 are treated the same; you can use either to indicate that validation failed.
PAGE 129
while (( i < ${#SG_SERVICE_NAME[*]} )) do case ${SG_SERVICE_CMD[i]} in *monitor.
PAGE 130
4.8.11.2 Determining Why a Package Has Shut Down You can use an external script (or CUSTOMER DEFINED FUNCTIONS area of a legacy package control script) to find out why a package has shut down.
PAGE 131
monitored_subnet_access unconfigured for a monitored subnet is equivalent to FULL). (For legacy packages, see “Configuring Cross-Subnet Failover” (page 239)). • You should not use the wildcard (*) for node_name in the package configuration file, as this could allow the package to fail over across subnets when a node on the same subnet is eligible; failing over across subnets can take longer than failing over on the same subnet. List the nodes in order of preference instead of using the wildcard.
PAGE 132
Assuming nodeA is pkg1’s primary node (where it normally starts), create node_name entries in the package configuration file as follows: node_name nodeA node_name nodeB node_name nodeC node_name nodeD 4.8.12.2.2 Configuring monitored_subnet_access In order to monitor subnet 15.244.65.0 or 15.244.56.0, depending on where pkg1 is running, you would configure monitored_subnet and monitored_subnet_access in pkg1’s package configuration file as follows: monitored_subnet 15.244.65.
PAGE 133
If you intend to remove a node from the cluster configuration while the cluster is running, ensure that the resulting cluster configuration will still conform to the rules for cluster locks described above. See “Cluster Lock Planning” (page 84) for more information. If you are planning to add a node online, and a package will run on the new node, ensure that any existing cluster-bound volume groups for the package have been imported to the new node.
PAGE 134
PAGE 135
5 Building an HA Cluster Configuration This chapter and the next take you through the configuration tasks required to set up a Serviceguard cluster. You carry out these procedures on one node, called the configuration node, and Serviceguard distributes the resulting binary file to all the nodes in the cluster. In the examples in this chapter, the configuration node is named ftsys9, and the sample target node is called ftsys10.
PAGE 136
######################################################################### SGROOT=/opt/cmcluster # SG root directory SGCONF=/opt/cmcluster/conf # configuration files SGSBIN=/opt/cmcluster/bin # binaries SGLBIN=/opt/cmcluster/bin # binaries SGLIB=/opt/cmcluster/lib # libraries SGRUN=/opt/cmcluster/run # location of core dumps from daemons SGAUTOSTART=/opt/cmcluster/conf/cmcluster.rc # SG Autostart file Throughout this document, system filenames are usually given with one of these location prefixes.
PAGE 137
the file $SGCONF/cmclnodelist. This is sometimes referred to as a “bootstrap” file because Serviceguard consults it only when configuring a node into a cluster for the first time; it is ignored after that. It does not exist by default, but you will need to create it. You may want to add a comment such as the following at the top of the file: ########################################################### # Do not edit this file! # Serviceguard uses this file only to authorize access to an # unconfigured node.
PAGE 138
Serviceguard nodes can communicate over any of the cluster’s shared networks, so the network resolution service you are using (such as DNS, NIS, or LDAP) must be able to resolve each of their primary addresses on each of those networks to the primary hostname of the node in question. In addition, HP recommends that you define name resolution in each node’s /etc/hosts file, rather than rely solely on a service such as DNS.
PAGE 139
IMPORTANT: Serviceguard does not support aliases for IPv6 addresses. For information about configuring an IPv6–only cluster, or a cluster that uses a combination of IPv6 and IPv4 addresses for the nodes' hostnames, see “About Hostname Address Families: IPv4-Only, IPv6-Only, and Mixed Mode” (page 88). 5.1.5.
PAGE 140
NOTE: HP recommends that you also make the name service itself highly available, either by using multiple name servers or by configuring the name service into a Serviceguard package. 5.1.6 Ensuring Consistency of Kernel Configuration Make sure that the kernel configurations of all cluster nodes are consistent with the expected behavior of the cluster during failover.
PAGE 141
DEVICE=bond0 IPADDR=192.168.1.1 NETMASK=255.255.255.0 NETWORK=192.168.1.0 BROADCAST=192.168.1.255 ONBOOT=yes BOOTPROTO=none USERCTL=no For Red Hat 5 and Red Hat 6 only, add the following line to the ifcfg-bond0file: BONDING OPTS=’miimon=100 mode=1’ 2. Create an ifcfg-ethn file for each interface in the bond. All interfaces should have SLAVE and MASTER definitions.
PAGE 142
5.1.8.3 Viewing the Configuration You can test the configuration and transmit policy with ifconfig. For the configuration created above, the display should look like this: /sbin/ifconfig bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.
PAGE 143
REMOTE_IPADDR='' STARTMODE='onboot' BONDING_MASTER='yes' BONDING_MODULE_OPTS='miimon=100 mode=1' BONDING_SLAVE0='eth0' BONDING_SLAVE1='eth1' The above example configures bond0 with mii monitor equal to 100 and active-backup mode. Adjust the IP, BROADCAST, NETMASK, and NETWORK parameters to correspond to your configuration. As you can see, you are adding the configuration options BONDING_MASTER, BONDING-MODULE_OPTS, and BONDING_SLAVE.
PAGE 144
Table 7 Changing Linux Partition Types Prompt Response Action Performed 1. Command (m for help): n Create new partition 2. Partition number (1-4): 1 Partition affected 3.
PAGE 145
Creating a Lock LUN on a Whole LUN The lock LUN can be created on a whole LUN of at least 100K starting with the below patches. On Red Hat Enterprise Linux server, you have to install the following patches on Serviceguard Linux Version A.11.20.00: • SGLX_00339 for Red Hat Enterprise Linux 5 (x86_64 architecture) • SGLX_00340 for Red Hat Enterprise Linux 6 (x86_64 architecture) Patches can be downloaded from HP Support Center at http://www.hp.com/go/hpsc.
PAGE 146
You can build a cluster (next section) before or after defining volume groups for shared data storage. If you create the cluster first, information about storage can be added to the cluster and package configuration files after the volume groups are created. See “Volume Managers for Data Storage” (page 71) for an overview of volume management in HP Serviceguard for Linux.
PAGE 147
In this example, the disk described by device file /dev/sda has already been partitioned for Linux, into partitions named /dev/sda1 - /dev/sda7. The second internal device /dev/sdb and the two external devices /dev/sdc and /dev/sdd have not been partitioned. NOTE: fdisk may not be available for SUSE on all platforms. In this case, using YAST2 to set up the partitions is acceptable. 5.1.12.
PAGE 148
Prompt Response Action Performed Command (m for help): p Display partition data Command (m for help): w Write data to the partition table The following example of the fdisk dialog describes that the disk on the device file /dev/ sdc is set to Smart Array type partition, and appears as follows: fdisk /dev/sdc Command (m for help): t Partition number (1-4): 1 HEX code (type L to list codes): 8e Command (m for help): p Disk /dev/sdc: 64 heads, 32 sectors, 4067 cylinders Units = cylinders of 2048 * 512
PAGE 149
where node is the value of uname -n. 5. Run vgscan: vgscan NOTE: At this point, the setup for volume-group activation protection is complete. Serviceguard adds a tag matching the uname -n value of the owning node to each volume group defined for a package when the package runs and deletes the tag when the package halts. The command vgs -o +tags vgname will display any tags that are set for a volume group.
PAGE 150
5.1.12.5 Building Volume Groups and Logical Volumes 1. Use Logical Volume Manager (LVM) to create volume groups that can be activated by Serviceguard packages. For an example showing volume-group creation on LUNs, see “Building Volume Groups: Example for Smart Array Cluster Storage (MSA 2000 Series)” (page 149). (For Fibre Channel storage you would use device-file names such as those used in the section “Creating Partitions” (page 147)). 2. 3.
PAGE 151
NOTE: Use vgchange --deltag only if you are implementing volume-group activation protection. Remember that volume-group activation protection, if used, must be implemented on each node. 2. To get the node ftsys10 to see the new disk partitioning that was done on ftsys9, reboot: reboot The partition table on the rebooted node is then rebuilt using the information placed on the disks when they were partitioned on the other node. NOTE: 3. You must reboot at this time.
PAGE 152
2. On ftsys10, activate the volume group, mount the file system, write a date stamp on to the shared file, and then look at the content of the file: vgchange --addtag $(uname -n) vgpkgB vgchange -a y vgpkgB mount /dev/vgpkgB/lvol1 /extra echo ‘Written by’ ‘hostname‘ ‘on’ ‘date‘ >> /extra/datestamp cat /extra/datestamp You should see something like the following, including the date stamp written by the other node: Written by ftsys9.mydomain on Mon Jan 22 14:23:44 PST 2006 Written by ftsys10.
PAGE 153
NOTE: Be careful if you use YAST or YAST2 to configure volume groups, as that may cause all volume groups to be activated. After running YAST or YAST2, check that volume groups for Serviceguard packages not currently running have not been activated, and use LVM commands to deactivate any that have. For example, use the command vgchange -a n /dev/sgvg00 to deactivate the volume group sgvg00. Red Hat It is not necessary to prevent vgscan on Red Hat.
PAGE 154
5.2.1 cmquerycl Options 5.2.1.1 Speeding up the Process In a larger or more complex cluster with many nodes, networks or disks, the cmquerycl command may take several minutes to complete. To speed up the configuration process, you can direct the command to return selected information only by using the -k and -w options: -k eliminates some disk probing, and does not return information about potential cluster lock volume groups and lock physical volumes.
PAGE 155
cmquerycl -v -h ipv6 -C $SGCONF/clust1.conf -n ftsys9 -n ftsys10 • -h ipv4 tells Serviceguard to discover and configure only IPv4 subnets. If it does not find any eligible subnets, the command will fail. • -h ipv6 tells Serviceguard to discover and configure only IPv6 subnets. If it does not find any eligible subnets, the command will fail.
PAGE 156
A cluster lock LUN or quorum server, is required for two-node clusters. To obtain a cluster configuration file that includes Quorum Server parameters, use the -q option of the cmquerycl command, specifying a Quorum Server hostname or IP address, for example (all on one line): cmquerycl -q -n ftsys9 -n ftsys10 -C .
PAGE 157
15.13.164.0 15.13.172.0 15.13.165.0 15.13.182.0 15.244.65.0 15.244.56.0 lan1 lan1 lan1 lan1 lan2 lan2 lan2 lan2 lan3 lan3 lan4 lan4 (nodeA) (nodeB) (nodeC) (nodeD) (nodeA) (nodeB) (nodeC) (nodeD) (nodeA) (nodeB) (nodeC) (nodeD) lan3 lan3 lan3 lan3 (nodeA) (nodeB) (nodeC) (nodeD) IPv6: 3ffe:1111::/64 3ffe:2222::/64 Possible Heartbeat IPs: 15.13.164.0 15.13.164.1 15.13.164.2 15.13.172.0 15.13.172.158 15.13.172.159 15.13.165.0 15.13.165.1 15.13.165.2 15.13.182.0 15.13.182.158 15.13.182.
PAGE 158
The heartbeat can comprise multiple IPv4 subnets joined by a router. In this case at least two heartbeat paths must be configured for each cluster node. See also the discussion of HEARTBEAT_IP (page 95), and “Cross-Subnet Configurations” (page 27). 5.2.6 Specifying Maximum Number of Configured Packages This value must be equal to or greater than the number of packages currently configured in the cluster. The count includes all types of packages: failover, multi-node, and system multi-node.
PAGE 159
Figure 27 Access Roles 5.2.8.3 Levels of Access Serviceguard recognizes two levels of access, root and non-root: • Root access: Full capabilities; only role allowed to configure the cluster. As Figure 27 shows, users with root access have complete control over the configuration of the cluster and its packages. This is the only role allowed to use the cmcheckconf, cmapplyconf, cmdeleteconf, and cmmodnet -a commands.
PAGE 160
IMPORTANT: Users on systems outside the cluster can gain Serviceguard root access privileges to configure the cluster only via a secure connection (rsh or ssh). • Non-root access: Other users can be assigned one of four roles: ◦ Full Admin: Allowed to perform cluster administration, package administration, and cluster and package view operations. These users can administer the cluster, but cannot configure or create a cluster. Full Admin includes the privileges of the Package Admin role.
PAGE 161
Access control policies are defined by three parameters in the configuration file: • Each USER_NAME can consist either of the literal ANY_USER, or a maximum of 8 login names from the /etc/passwd file on USER_HOST. The names must be separated by spaces or tabs, for example: # Policy 1: USER_NAME john fred patrick USER_HOST bit USER_ROLE PACKAGE_ADMIN • USER_HOST is the node where USER_NAME will issue Serviceguard commands.
PAGE 162
USER_HOST bit USER_ROLE PACKAGE_ADMIN If this policy is defined in the cluster configuration file, it grants user john the PACKAGE_ADMIN role for any package on node bit. User john also has the MONITOR role for the entire cluster, because PACKAGE_ADMIN includes MONITOR. If the policy is defined in the package configuration file for PackageA, then user john on node bit has the PACKAGE_ADMIN role only for PackageA. Plan the cluster’s roles and validate them as soon as possible.
PAGE 163
5.2.8.5 Package versus Cluster Roles Package configuration will fail if there is any conflict in roles between the package configuration and the cluster configuration, so it is a good idea to have the cluster configuration file in front of you when you create roles for a package; use cmgetconf to get a listing of the cluster configuration file.
PAGE 164
# Warning: Neither a quorum server nor a lock lun was specificed. # A Quorum Server or a lock lun is required for clusters of only two nodes. If you attempt to configure both a quorum server and a lock LUN, the following message appears on standard output when issuing the cmcheckconf or cmapplyconf command: Duplicate cluster lock, line 55. Quorum Server already specified. 5.2.
PAGE 165
3. 4. Verify that nodes leave and enter the cluster as expected using the following steps: • Halt the cluster. You can use Serviceguard Manager or the cmhaltnode command. • Check the cluster membership to verify that the node has left the cluster. You can use the Serviceguard Manager main page or the cmviewcl command. • Start the node. You can use Serviceguard Manager or the cmrunnode command. • Verify that the node has returned to operation.
PAGE 166
NOTE: The /sbin/init.d/cmcluster file may call files that Serviceguard stores in$SGCONF/ rc. (See “Understanding the Location of Serviceguard Files” (page 135) for information about Serviceguard directories on different Linux distributions.) This directory is for Serviceguard use only! Do not move, delete, modify, or add files in this directory. 5.3.
PAGE 167
If you must disable identd, do the following on each node after installing Serviceguard but before each node rejoins the cluster (For example, before issuing a cmrunnode or cmruncl). For Red Hat and SUSE: 1. 2. Change the value of the server_args parameter in the file /etc/xinetd.d/hacl-cfg from -c to -c -i Restart xinetd: /etc/init.d/xinetd restart 5.3.6 Deleting the Cluster Configuration You can delete a cluster configuration by means of the cmdeleteconf command.
PAGE 168
Table 8 describes the various scenarios for rebuilding the deadman driver: Table 8 Rebuilding the Deadman Driver Scenario Should the deadman driver be rebuilt? Description Online OS upgrade between minor releases. For example, RHEL 6.1 to RHEL 6.2 Yes You must manually rebuild the deadman driver as the OS upgrade process would have updated the kernel. Fresh installation of the OS No Whenever you install the OS for the first time, Serviceguard must be installed afresh.
PAGE 169
6 Configuring Packages and Their Services Serviceguard packages group together applications and the services and resources they depend on. The typical Serviceguard package is a failover package that starts on one node but can be moved (“failed over”) to another if necessary. For more information, see “What is Serviceguard for Linux? ” (page 19), “How the Package Manager Works” (page 43), and“Package Configuration Planning ” (page 104).
PAGE 170
When you have made these decisions, you are ready to generate the package configuration file; see “Generating the Package Configuration File” (page 191). 6.1.1 Types of Package: Failover, Multi-Node, System Multi-Node There are three types of packages: • Failover packages. This is the most common type of package. Failover packages run on one node at a time.
PAGE 171
and start the package for the first time. But if you then halt the multi-node package via cmhaltpkg, it can be re-started only by means of cmrunpkg, not cmmodpkg. • If a multi-node package is halted via cmhaltpkg, package switching is not disabled. This means that the halted package will start to run on a rebooted node, if it is configured to run on that node and its dependencies are met.
PAGE 172
Table 9 Base Modules Module Name Parameters (page) Comments failover package_name (page 175) * module_name (page 175) * module_version (page 175) * package_type (page 175) package_description (page 175) * node_name (page 176) auto_run (page 176) node_fail_fast_enabled (page 177) run_script_timeout (page 177) halt_script_timeout (page 177) successor_halt_script_timeout (page 178) Base module. Use as primary building block for failover packages.
PAGE 173
Table 10 Optional Modules Module Name Parameters (page) Comments dependency dependency_name (page 180) * dependency_condition (page 180) dependency_location (page 180) Add to a base module to create a package that depends on one or more other packages. weight weight_name (page 181) * weight value (page 181) * Add to a base module to create a package that has weight that will be counted against a node's capacity.
PAGE 174
Table 10 Optional Modules (continued) Module Name Parameters (page) Comments acp user_name (page 191) user_host (page 190) user_role (page 191) Add to a base module to configure Access Control Policies for the package. all all parameters Use if you are creating a complex package that requires most or all of the optional parameters; or if you want to see the specifications and comments for all available parameters.
PAGE 175
NOTE: For more information, see the comments in the editable configuration file output by the cmmakepkg command, and the cmmakepkg (1m) manpage.
PAGE 176
6.1.4.6 node_name The node on which this package can run, or a list of nodes in order of priority, or an asterisk (*) to indicate all nodes. The default is *. For system multi-node packages, you must specify node_name *. If you use a list, specify each node on a new line, preceded by the literal node_name, for example: node_name node_name node_name The order in which you specify the node names is important.
PAGE 177
6.1.4.8 node_fail_fast_enabled Can be set to yes or no. The default is no. yes means the node on which the package is running will be halted (reboot) if the package fails; no means Serviceguard will not halt the system.
PAGE 178
If a timeout occurs: • Switching will be disabled. • The current node will be disabled from running the package. If a halt-script timeout occurs, you may need to perform manual cleanup. See Chapter 8: “Troubleshooting Your Cluster” (page 249). 6.1.4.11 successor_halt_timeout Specifies how long, in seconds, Serviceguard will wait for packages that depend on this package to halt, before halting this package. Can be 0 through 4294, or no_timeout. The default is no_timeout.
PAGE 179
• configured_node means Serviceguard will attempt to start the package on the first available node in the list you provide under node_name (page 176). • min_package_node means Serviceguard will start the package on whichever node in the node_name list has the fewest packages running at the time. • site_preferred means Serviceguard will try all the eligible nodes on the local SITE before failing the package over to a node on another SITE.
PAGE 180
If you assign a priority, it must be unique in this cluster. A lower number indicates a higher priority, and a numerical priority is higher than no_priority. HP recommends assigning values in increments of 20 so as to leave gaps in the sequence; otherwise you may have to shuffle all the existing priorities when assigning priority to a new package. IMPORTANT: Because priority is a matter of ranking, a lower number indicates a higher priority (20 is a higher priority than 40).
PAGE 181
6.1.4.21 weight_name, weight_value These parameters specify a weight for a package; this weight is compared to a node's available capacity (defined by the CAPACITY_NAME and CAPACITY_VALUE parameters in the cluster configuration file) to determine whether the package can run there. Both parameters are optional, but if weight_value is specified, weight_name must also be specified, and must come first. You can define up to four weights, corresponding to four different capacities, per cluster.
PAGE 182
6.1.4.23 monitored_subnet_access In cross-subnet configurations, specifies whether each monitored_subnet is accessible on all nodes in the package’s node_name list (page 176), or only some. Valid values are PARTIAL, meaning that at least one of the nodes has access to the subnet, but not all; and FULL, meaning that all nodes have access to the subnet. The default is FULL, and it is in effect if monitored_subnet_access is not specified.
PAGE 183
6.1.4.25 ip_subnet_node In a cross-subnet configuration, specifies which nodes an ip_subnet is configured on. If no ip_subnet_nodes are listed under an ip_subnet, it is assumed to be configured on all nodes in this package’s node_name list (page 176). Can be added or deleted while the package is running, with these restrictions: • The package must not be running on the node that is being added or deleted.
PAGE 184
NOTE: Be careful when defining service run commands. Each run command is executed in the following way: • The cmrunserv command executes the run command. • Serviceguard monitors the process ID (PID) of the process the run command creates. • When the command exits, Serviceguard determines that a failure has occurred and takes appropriate action, which may include transferring the package to an adoptive node.
PAGE 185
• generic_resource_name • generic_resource_evaluation_type • generic_resource_up_criteria See the descriptions that follow. The following is an example of defining generic resource parameters: generic_resource_name generic_resource_evaluation_type generic_resource_up_criteria cpu_monitor during_package_start <50 See the package configuration file for more examples. 6.1.4.33 generic_resource_evaluation_type Defines when the status of a generic resource is evaluated.
PAGE 186
NOTE: Operators other than the ones mentioned above are not supported. This attribute does not accept more than one up criterion. For example, >> 10, << 100 are not valid.
PAGE 187
fs_fsck_opt "" fs_type "ext3" A logical volume must be built on an LVM volume group. Logical volumes can be entered in any order. A gfs file system can be configured using only the fs_name, fs_directory, and fs_mount_opt parameters; see the configuration file for an example. Additional rules apply for gfs as explained under fs_type. NOTE: Red Hat GFS is not supported in Serviceguard A.11.20.00. For an NFS-imported file system, see the discussion under fs_name (page 187) and fs_server (page 188).
PAGE 188
For an NFS-imported file system, the additional parameters required are fs_server, fs_directory, fs_type, and fs_mount_opt; see fs_server (page 188) for an example. CAUTION: Before configuring an NFS-imported file system into a package, make sure you have read and understood the rules and guidelines under “Planning for NFS-mounted File Systems” (page 106), and configured the cluster parameter CONFIGURED_IO_TIMEOUT_EXTENSION, described under “Cluster Configuration Parameters ” (page 91).
PAGE 189
Table 11 File System Types and Platforms File system type Supported platform ext3 Red Hat Enterprise Linux 5 Red Hat Enterprise Linux 6 SUSE Linux Enterprise Server 11 ext4 Red Hat Enterprise Linux 51 Red Hat Enterprise Linux 6 XFS Red Hat Enterprise Linux 6 SUSE Linux Enterprise Server 11 1 This is supported from SGLX_00354.tar.shar patch and later.
PAGE 190
6.1.4.48 pv Physical volume on which persistent reservations (PR) will be made if the device supports it. IMPORTANT: This parameter is for use only by HP partners, who should follow the instructions in the package configuration file. For information about Serviceguard's implementation of PR, see “About Persistent Reservations” (page 72). 6.1.4.49 pev_ Specifies a package environment variable that can be passed to external_pre_script, external_script, or both, by means of the cmgetpkgenv command.
PAGE 191
only the hostname portion, of the fully qualified domain name). As with user_name, be careful to spell the keywords exactly as given. 6.1.4.53 user_name Specifies the name of a user who has permission to administer this package. See also user_host (page 190) and user_role; these three parameters together define the access control policy for this package (see “Controlling Access to the Cluster” (page 158)). These parameters must be defined in this order: user_name, user_host, user_role.
PAGE 192
modules. This file will consist of a base module (failover, multi-node or system multi-node) plus the modules that contain the additional parameters you have decided to include. 6.2.1 Before You Start Before you start building a package, create a subdirectory for it in the $SGCONF directory, for example: mkdir $SGCONF/pkg1 (See “Understanding the Location of Serviceguard Files” (page 135) for information about Serviceguard pathnames.) 6.2.
PAGE 193
• To generate a configuration file adding the Persistent Reservation module to an existing package: cmmakepkg -i $SGCONF/pkg1/pkg1.conf -m sg/pr_cntl • To create a serviceguard-xdc package in serviceguard-xdc environment: cmmakepkg -m sg/all -m xdc/xdc pkg_xdc.conf cmcheckconf -P pkg_xdc.conf cmapplyconf -P pkg_xdc.conf 6.2.3 Next Step The next step is to edit the configuration file you have generated; see “Editing the Configuration File” (page 193). 6.
PAGE 194
• node_name. Enter the name of each cluster node on which this package can run, with a separate entry on a separate line for each node. • auto_run. For failover packages, enter yes to allow Serviceguard to start the package on the first available node specified by node_name, and to automatically restart it later if it fails. Enter no to keep Serviceguard from automatically starting the package. • node_fail_fast_enabled.
PAGE 195
◦ enter values for service_fail_fast_enabled and service_halt_timeout if you need to change them from their defaults. ◦ service_restart if you want the package to restart the service if it exits. (A value of unlimited can be useful if you want the service to execute in a loop, rather than exit and halt the package.) Include a service entry for disk monitoring if the package depends on monitored disks.
PAGE 196
• If the package will run an external script, use the external_script parameter (see (page 190)) to specify the full pathname of the script, for example, $SGCONF/pkg1/script1. See “About External Scripts” (page 127), and the comments in the configuration file, for more information. • Configure the Access Control Policy for up to eight specific users or any_user. The only user role you can configure in the package configuration file is package_admin for the package in question.
PAGE 197
For more information, see the manpage for cmcheckconf (1m) and “Verifying Cluster and Package Components” (page 208). When cmcheckconf has completed without errors, apply the package configuration, for example: cmapplyconf -P $SGCONF/pkg1/pkg1.conf This adds the package configuration information to the binary cluster configuration file in the $SGCONF directory and distributes it to all the cluster nodes.
PAGE 198
email_id specified in the package configuration file is sguser@xyz.com. The following e-mail notification is sent to sguser@xyz.com: Date: Tue, 9 Oct 2012 23:18:01 -0700 From: root Message-Id: <201210100618.q9A6I1d9023167@node1.hp.com> To: sguser@xyz.com Subject: Serviceguard Alert: Package xdcpkg has lost access to my_disk2 of md0 on node1 Hi, There seems to be an issue in the package xdcpkg in your Serviceguard cluster. For more information, check the package and system logs of node1.
PAGE 199
7 Cluster and Package Maintenance This chapter describes the cmviewcl command, then shows how to start and halt a cluster or an individual node, how to perform permanent reconfiguration, and how to start, halt, move, and modify packages during routine maintenance of the cluster.
PAGE 200
• starting - The cluster is in the process of determining its active membership. At least one cluster daemon is running. • unknown - The node on which the cmviewcl command is issued cannot communicate with other nodes in the cluster. 7.1.4 Node Status and State The status of a node is either up (active as a member of the cluster) or down (inactive in the cluster), depending on whether its cluster daemon is running or not.
PAGE 201
• detached - A package is said to be detached from the cluster or node where it was running, when the cluster or node is halted with —d option. Serviceguard no longer monitors this package. The last known status of the package before it is detached from the cluster was up. • unknown - Serviceguard could not determine the status at the time cmviewcl was run. A system multi-node package is up when it is running on all the active cluster nodes.
PAGE 202
7.1.6 Package Switching Attributes cmviewcl shows the following package switching information: • AUTO_RUN: Can be enabled or disabled. For failover packages, enabled means that the package starts when the cluster starts, and Serviceguard can switch the package to another node in the event of failure. For system multi-node packages, enabled means an instance of the package can start on a new node joining the cluster (disabled means it will not).
PAGE 203
Failover packages can also be configured with one of two values for the failback_policy parameter (page 179), and these are also displayed in the output of cmviewcl -v: • automatic: Following a failover, a package returns to its primary node when the primary node becomes available again. • manual: Following a failover, a package will run on the adoptive node until moved back to its original node by a system administrator. 7.1.
PAGE 204
NOTE: The Script_Parameters section of the PACKAGE output of cmviewcl shows the Subnet status only for the node that the package is running on. In a cross-subnet configuration, in which the package may be able to fail over to a node on another subnet, that other subnet is not shown (see “Cross-Subnet Configurations” (page 27)). 7.1.11.
PAGE 205
UNOWNED_PACKAGES PACKAGE pkg2 STATUS down STATE unowned AUTO_RUN disabled NODE unowned Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover configured_node Failback manual Script_Parameters: ITEM STATUS NODE_NAME Service down Generic Resource up ftsys9 Subnet up Generic Resource up ftsys10 Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled NAME service2 sfm_disk1 15.13.168.
PAGE 206
Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover configured_node Failback manual Script_Parameters: ITEM Service Service up Subnet Generic Resource up STATUS up 0 up MAX_RESTARTS RESTARTS NAME 0 0 0 sfm_disk_monitor 0 0 sfm_disk Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled NODE ftsys10 STATUS up NAME ftsys10 ftsys9 service2 15.13.168.0 (current) STATE running Network_Parameters: INTERFACE STATUS PRIMARY up PRIMARY up NAME eth0 eth1 7.1.11.
PAGE 207
7.1.11.7 Viewing Information about Unowned Packages The following example shows packages that are currently unowned, that is, not running on any configured node.
PAGE 208
NOTE: • You can consider setting up a cron (1m) job to run the cmcheckconf command regularly. For more information, see “Setting up Periodic Cluster Verification” (page 210). • These new checks are not done for legacy packages. For information about legacy and modular packages, see Chapter 6: Configuring Packages and Their Services (page 169). • The cmapplyconf command performs the same verification as the cmcheckconf command.
PAGE 209
Table 12 Verifying Cluster and Package Components (continued) Component (Context) Tool or Command; More Information Description Quorum Server (cluster) cmcheckconf (1m), cmapplyconf (1m). These commands verify that the quorum server, if used, is running and all nodes are authorized to access it; and, if more than one IP address is specified, that the quorum server is reachable from all nodes through both the IP addresses.
PAGE 210
Table 12 Verifying Cluster and Package Components (continued) Component (Context) Tool or Command; More Information Description External scripts and pre-scripts (modular package) cmcheckconf (1m), cmapplyconf (1m) A non-zero return value from any script results the commands to fail. NFS server connectivity (package) cmcheckconf (1m), cmapplyconf (1m) If the package configuration file contains NFS file system, it validates the following: • Connectivity to the NFS server from all the package nodes.
PAGE 211
• Unreachable DNS server. • Consistency of settings in .rhosts. • Nested mount points. 7.
PAGE 212
cmruncl -v -n ftsys9 -n ftsys10 CAUTION: HP Serviceguard cannot guarantee data integrity if you try to start a cluster with the cmruncl -n command while a subset of the cluster's nodes are already running a cluster. If the network connection is down between nodes, using cmruncl -n might result in a second cluster forming, and this second cluster might start up the same applications that are already running on the other cluster. The result could be two applications overwriting each other's data on the disks.
PAGE 213
This halts any packages running on the node ftsys9 by executing the halt instructions in each package's master control script. ftsys9 is halted and the packages start on the adoptive node, ftsys10. 7.2.4 Halting the Entire Cluster You can use Serviceguard Manager, or Serviceguard commands as shown below, to halt a running cluster. The cmhaltcl command can be used to halt the entire cluster. This command causes all nodes in a configured cluster to halt their HP Serviceguard daemons.
PAGE 214
• Restart normal package monitoring by restarting the node (cmrunnode) or the cluster (). • You can forcefully halt a detached node (cmhaltnode (1m)) with the -f option. 7.3.2 Rules and Restrictions The following rules and restrictions apply. • All the nodes in the cluster must be running Serviceguard A.11.20.10 or later. • All the configured cluster nodes must be reachable by an available network.
PAGE 215
you would need to run cmhaltpkg (1m) to halt the package on the node where it is detached. • You cannot halt a package that is in a transitory state such as STARTING or HALTING. For more information about package states, see “Package Status and State” (page 200). • A package that is in a DETACHED or MAINTENANCE state cannot be moved to a halt_aborted state or vice versa. For more information, see “Handling Failures During Package Halt” (page 218). 7.3.
PAGE 216
• • When a node having detached packages is back up after a reboot they can: ◦ Rejoin the cluster and the detached packages can move to "running" or "failed" state. If the detached packages are moved to running state, then they must be halted and rerun as they may have several inconsistencies post reboot. ◦ Not rejoin the cluster and the detached packages remain detached. Such packages must be halted and rerun to avoid any inconsistencies that can be caused due to the reboot.
PAGE 217
NOTE: 3. If you do not do this, the cmhaltcl in the next step will fail. Halt the cluster with the -d (detach) option: cmhaltcl -d NOTE: -d and -f are mutually exclusive. See cmhaltcl (1m) for more information. To re-attach the packages, restart cluster: cmrunnode node1 7.3.
PAGE 218
You can use Serviceguard Manager to start a package, or Serviceguard commands as shown below. Use the cmrunpkg command to run the package on a particular node, then use the cmmodpkg command to enable switching for the package; for example: cmrunpkg -n ftsys9 pkg1 cmmodpkg -e pkg1 This starts up the package on ftsys9, then enables package switching. This sequence is necessary when a package has previously been halted on some node, since halting the package disables switching. 7.4.1.
PAGE 219
NOTE: Non-native Serviceguard modules are those that are not delivered with the Serviceguard product. These are additional modules such as those supplied with HP Serviceguard toolkit modules (for example, HP Serviceguard Contributed Toolkit Suite, Oracle, NFS toolkit, EDB PPAS, Sybase, and so on). This allows errors to be cleaned up manually during the halt process thus minimizing the risk of other follow on errors and reducing package downtime.
PAGE 220
To move the package, first halt it where it is running using the cmhaltpkg command. This action not only halts the package, but also disables package switching. After it halts, run the package on the new node using the cmrunpkg command, then re-enable switching as described below. 7.4.4 Changing Package Switching Behavior There are two options to consider: • Whether the package can switch (fail over) or not. • Whether the package can switch to a particular node or not.
PAGE 221
• Maintenance mode is chiefly useful for modifying networks while the package is running. See “Performing Maintenance Using Maintenance Mode” (page 223). • Partial-startup maintenance mode allows you to work on package services, file systems, and volume groups. See “Performing Maintenance Using Partial-Startup Maintenance Mode” (page 224). • Neither maintenance mode nor partial-startup maintenance mode can be used for legacy packages, multi-node packages, or system multi-node packages.
PAGE 222
◦ A script times out ◦ The limit of a restart count is exceeded 7.5.1.1 Rules for a Package in Maintenance Mode or Partial-Startup Maintenance Mode IMPORTANT: See the latest Serviceguard release notes for important information about version requirements for package maintenance. • The package must have package switching disabled before you can put it in maintenance mode. • You can put a package in maintenance mode only on one node.
PAGE 223
7.5.1.2 Dependency Rules for a Package in Maintenance Mode or Partial-Startup Maintenance Mode You cannot configure new dependencies involving a package running in maintenance mode, and in addition the following rules apply (we'll call the package in maintenance mode pkgA). • The packages that depend on pkgA must be down and disabled when you place pkgA in maintenance mode.
PAGE 224
7.5.3 Performing Maintenance Using Partial-Startup Maintenance Mode To put a package in partial-startup maintenance mode, you put it in maintenance mode, then restart it, running only those modules that you will not be working on. 7.5.3.1 Procedure Follow this procedure to perform maintenance on a package. In this example, we'll assume a package pkg1 is running on node1, and that we want to do maintenance on the package's services. 1. Halt the package: cmhaltpkg pkg1 2.
PAGE 225
You can also use -e in combination with -m. This has the effect of starting all modules up to and including the module identified by -m, except the module identified by -e. In this case the excluded (-e) module must be earlier in the execution sequence (as listed near the top of the package's configuration file) than the -m module. For example: cmrunpkg -m sg/services -e sg/package_ip pkg1 NOTE: The full execution sequence for starting a package is: 1. The master control script itself 2.
PAGE 226
7.6.1 Previewing the Effect of Cluster Changes Many variables affect package placement, including the availability of cluster nodes; the availability of networks and other resources on those nodes; failover and failback policies; and package weights, dependencies, and priorities, if you have configured them. You can preview the effect on packages of certain actions or events before they actually occur.
PAGE 227
cmmodpkg -e -t pkg1 You will see output something like this: package:pkg3|node:node2|action:failing package:pkg2|node:node2|action:failing package:pkg2|node:node1|action:starting package:pkg3|node:node1|action:starting package:pkg1|node:node1|action:starting cmmodpkg: Command preview completed successfully This shows that pkg1, when enabled, will “drag” pkg2 and pkg3 to its primary node, node1. It can do this because of its higher priority; see “Dragging Rules for Simple Dependencies” (page 114).
PAGE 228
This shows that pkg1, when enabled, will “drag” pkg2 and pkg3 to its primary node, node1. It can do this because of its higher priority; see “Dragging Rules for Simple Dependencies” (page 114). Running cmeval confirms that all three packages will successfully start on node2 (assuming conditions do not change between now and when you actually enable pkg1, and there are no failures in the run scripts.) NOTE: cmeval cannot predict run and halt script failures.
PAGE 229
1. Use the following command to store a current copy of the existing cluster configuration in a temporary file in case you need to revert to it: cmgetconf -C temp.conf 2. Specify a new set of nodes to be configured and generate a template of the new configuration (all on one line): cmquerycl -C clconfig.conf -c cluster1 -n ftsys8 -n ftsys9 -n ftsys10 3. 4. Edit clconfig.conf to check the information about the new node. Verify the new configuration: cmcheckconf -C clconfig.conf 5.
PAGE 230
6. From ftsys8 or ftsys9, apply the changes to the configuration and distribute the new binary configuration file to all cluster nodes.: cmapplyconf -C clconfig.conf NOTE: If you are trying to remove an unreachable node on which many packages are configured to run, you may see the following message: The configuration change is too large to process while the cluster is running. Split the configuration change into multiple requests or halt the cluster.
PAGE 231
• You cannot delete a subnet or IP address from a node while a package that uses it (as a monitored_subnet, ip_subnet, or ip_address) is configured to run on that node. Information about these parameters begins at monitored_subnet (page 181). • You cannot change the IP configuration of an interface (NIC) used by the cluster in a single transaction (cmapplyconf).
PAGE 232
NETWORK_INTERFACE NODE_NAME NETWORK_INTERFACE HEARTBEAT_IP NETWORK_INTERFACE HEARTBEAT_IP NETWORK_INTERFACE 3. lan3 ftsys10 lan1 192.3.17.19 lan0 15.13.170.19 lan3 Verify the new configuration: cmcheckconf -C clconfig.conf 4. Apply the changes to the configuration and distribute the new binary configuration file to all cluster nodes: cmapplyconf -C clconfig.conf If you were configuring the subnet for data instead, and wanted to add it to a package configuration, you would now need to: 1. 2. 3. 4.
PAGE 233
7.6.5 Updating the Cluster Lock LUN Configuration Online Proceed as follows. IMPORTANT: See “What Happens when You Change the Quorum Configuration Online” (page 43) for important information. 1. 2. 3. In the cluster configuration file, modify the value of CLUSTER_LOCK_LUN for each node. Run cmcheckconf to check the configuration. Run cmapplyconf to apply the configuration. If you need to replace the physical device, see “Replacing a Lock LUN” (page 251). 7.6.
PAGE 234
prioritized list of cluster nodes on which the package can run together with definitions of the acceptable types of failover allowed for the package. 7.7.1.1 Using Serviceguard Manager to Configure a Package You can create a legacy package and its control script in Serviceguard Manager; use the Help for detailed instructions. 7.7.1.2 Using Serviceguard Commands to Configure a Package Use the following procedure to create a legacy package. 1.
PAGE 235
• FAILBACK_POLICY. For failover packages, enter the failback_policy (page 179). • NODE_NAME. Enter the node or nodes on which the package can run; as described under node_name (page 176). • AUTO_RUN. Configure the package to start up automatically or manually; as described under auto_run (page 176). • NODE_FAIL_FAST_ENABLED. Enter the policy as described under node_fail_fast_enabled (page 177). • RUN_SCRIPT and HALT_SCRIPT.
PAGE 236
7.7.2 Creating the Package Control Script For legacy packages, the package control script contains all the information necessary to run all the services in the package, monitor them during operation, react to a failure, and halt the package when necessary. You can use Serviceguard Manager, Serviceguard commands, or a combination of both, to create or modify the package control script. Each package must have a separate control script, which must be executable.
PAGE 237
7.7.2.2 Adding Customer Defined Functions to the Package Control Script You can add additional shell commands to the package control script to be executed whenever the package starts or stops. Enter these commands in the CUSTOMER DEFINED FUNCTIONS area of the script. If your package needs to run short-lived processes, such as commands to initialize or halt a packaged application, you can also run these from the CUSTOMER DEFINED FUNCTIONS.
PAGE 238
7.7.3 Verifying the Package Configuration Serviceguard checks the configuration you created and reports any errors. For legacy packages, you can do this in Serviceguard Manager: click Check to verify the package configuration you have done under any package configuration tab, or to check changes you have made to the control script. Click Apply to verify the package as a whole. See the local Help for more details.
PAGE 239
7.7.4.3 Distributing the Binary Cluster Configuration File with Linux Commands Use the following steps from the node on which you created the cluster and package configuration files: • Verify that the configuration file is correct. Use the following command: cmcheckconf -C $SGCONF/cmcl.conf -P $SGCONF/pkg1/pkg1.conf • Generate the binary configuration file and distribute it across the nodes. cmapplyconf -v -C $SGCONF/cmcl.conf -P $SGCONF/pkg1/pkg1.
PAGE 240
NOTE: Configuring monitored_subnet_access as FULL (or not configuring monitored_subnet_access) for either of these subnets will cause the package configuration to fail, because neither subnet is available on all the nodes. 7.7.5.3 Creating Subnet-Specific Package Control Scripts Now you need to create control scripts to run the package on the four nodes.
PAGE 241
NOTE: The cmmigratepkg command requires Perl version 5.8.3 or higher on the system on which you run the command. 7.8.2 Reconfiguring a Package on a Running Cluster You can reconfigure a package while the cluster is running, and in some cases you can reconfigure the package while the package itself is running; see “Allowable Package States During Reconfiguration ” (page 243). You can do this in Serviceguard Manager (for legacy packages), or use Serviceguard commands.
PAGE 242
6. You can now safely delete the original external script on all nodes that are configured to run the package. 7.8.4 Reconfiguring a Package on a Halted Cluster You can also make permanent changes in the package configuration while the cluster is not running. Use the same steps as in “Reconfiguring a Package on a Running Cluster ”. 7.8.5 Adding a Package to a Running Cluster You can create a new package and add it to the cluster configuration while the cluster is up and while other packages are running.
PAGE 243
7.8.8 Allowable Package States During Reconfiguration In many cases, you can make changes to a package’s configuration while the package is running. The table that follows shows exceptions — cases in which the package must not be running, or in which the results might not be what you expect — as well as differences between modular and legacy packages. CAUTION: is running.
PAGE 244
Table 14 Types of Changes to Packages (continued) Change to the Package Required Package State Change halt script contents: legacy Package can be running, but should not be halting. package Timing problems may occur if the script is changed while the package is halting. Add or delete a service: modular package Package can be running. Add or delete a service: legacy package Package must not be running. Change service_restart: modular package Package can be running.
PAGE 245
Table 14 Types of Changes to Packages (continued) Change to the Package Required Package State Remove a volume group: legacy package Package must not be running. Change a file system: modular package Package should not be running (unless you are only changing fs_umount_opt).
PAGE 246
Table 14 Types of Changes to Packages (continued) Change to the Package Required Package State Add or delete a configured dependency Both packages can be either running or halted. Special rules apply to packages in maintenance mode; see “Dependency Rules for a Package in Maintenance Mode or Partial-Startup Maintenance Mode ” (page 223). For dependency purposes, a package being reconfigured is considered to be UP.
PAGE 247
7.8.8.1 Changes that Will Trigger Warnings Changes to the following will trigger warnings, giving you a chance to cancel, if the change would cause the package to fail. NOTE: You will not be able to cancel if you use cmapplyconf -f. • Package nodes • Package dependencies • Package weights (and also node capacity, defined in the cluster configuration file) • Package priority • auto_run • failback_policy 7.
PAGE 248
7.11 Removing Serviceguard from a System If you want to disable a node permanently from Serviceguard, use the rpm -e command to delete the software. CAUTION: Remove the node from the cluster first. If you run the rpm -e command on a server that is still a member of a cluster, it will cause that cluster to halt, and the cluster to be deleted. To remove Serviceguard: 1. 2. 3. If the node is an active member of a cluster, halt the node first.
PAGE 249
8 Troubleshooting Your Cluster This chapter describes how to verify cluster operation, how to review cluster status, how to add and replace hardware, and how to solve some typical cluster problems.
PAGE 250
You can also test the package manager using generic resources. Perform the following procedure for each package on the cluster: 1. Obtain the generic resource that is configured in a package by entering cmviewcl -v -p 2. Set the status of generic resource to DOWN using the following command: cmsetresource -r –s down 3. To view the package status, enter cmviewcl -v The package should be running on the specified adoptive node. 4.
PAGE 251
• All cables • Disk interface cards Some monitoring can be done through simple physical inspection, but for the most comprehensive monitoring, you should examine the system log file (/var/log/messages) periodically for reports on all configured HA devices. The presence of errors relating to a device will show the need for maintenance. 8.3 Replacing Disks The procedure for replacing a faulty disk mechanism depends on the type of disk configuration you are using.
PAGE 252
part of the recovery. Use the $SGCONF/scripts/sg/pr_cleanup script to do this. (The script is also in $SGCONF/bin/. See “Understanding the Location of Serviceguard Files” (page 135) for the locations of Serviceguard directories on various Linux distributions.
PAGE 253
7. If necessary, add the node back into the cluster using the cmrunnode command. (You can omit this step if the node is configured to join the cluster automatically.) Now Serviceguard will detect that the MAC address (LLA) of the card has changed from the value stored in the cluster binary configuration file, and it will notify the other nodes in the cluster of the new MAC address. The cluster will operate normally after this.
PAGE 254
4. Start the quorum server as follows: • Use the init q command to run the quorum server. Or • 5. 6. Create a package in another cluster for the Quorum Server, as described in the Release Notes for your version of Quorum Server. They can be found at http://www.hp.com/ go/hpux-serviceguard-docs (Select (HP Serviceguard Quorum Server Software). All nodes in all clusters that were using the old quorum server will connect to the new quorum server.
PAGE 255
TX packets:5741486 errors:1 dropped:0 overruns:1 carrier:896 collisions:26706 txqueuelen:100 Interrupt:9 Base address:0xdc00 eth1 Link encap:Ethernet HWaddr 00:50:DA:64:8A:7C inet addr:192.168.1.106 Bcast:192.168.1.255 Mask:255.255.255.
PAGE 256
Dec 14 14:34:45 star04 cmcld[2048]: Examine the file /usr/local/cmcluster/pkg5/pkg5_run.log for more details. The following is an example of a successful package starting: Dec Dec Dec Dec Dec Dec 14 14:39:27 star04 CM-CMD[2096]: cmruncl 14 14:39:27 star04 cmcld[2098]: Starting cluster management protocols.
PAGE 257
It doesn't check: • The correct setup of the power circuits. • The correctness of the package configuration script. 8.7.6 Reviewing the LAN Configuration The following networking commands can be used to diagnose problems: • ifconfig can be used to examine the LAN configuration. This command lists all IP addresses assigned to each LAN interface card. • arp -a can be used to check the arp tables. • cmscancl can be used to test IP-level connectivity between network interfaces in the cluster.
PAGE 258
Unable to halt the detached package on node as the node is not reachable. Retry once the node is reachable. In such a case, the node should be powered up and be accessible. You must then rerun the cmhaltpkg command. 8.8.3 Cluster Re-formations Caused by Temporary Conditions You may see Serviceguard error messages, such as the following, which indicate that a node is having problems: Member node_name seems unhealthy, not receiving heartbeats from it.
PAGE 259
For more information, including requirements and recommendations, see the MEMBER_TIMEOUT discussion under “Cluster Configuration Parameters ” (page 91). 8.8.5 System Administration Errors There are a number of errors you can make when configuring Serviceguard that will not show up when you start the cluster.
PAGE 260
specified in the package control script appear in the ifconfig output under the inet addr: in the ethX:Y block, use cmmodnet to remove them: cmmodnet -r -i where is the address indicated above and is the result of masking the with the mask found in the same line as the inet address in the ifconfig output. 3. Ensure that package volume groups are deactivated. First unmount any package logical volumes which are being used for file systems.
PAGE 261
Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.
PAGE 262
Unable to set client version at quorum server 192.6.7.2: reply timed out Probe of quorum server 192.6.7.2 timed out These messages could be an indication of an intermittent network problem; or the default quorum server timeout may not be sufficient. You can set the QS_TIMEOUT_EXTENSION to increase the timeout, or you can increase the MEMBER_TIMEOUT value. See “Cluster Configuration Parameters ” (page 91)for more information about these parameters.
PAGE 263
8.10 Troubleshooting Serviceguard Manager The following section describes how to troubleshoot issues related to Serviceguard Manager Problem Solution “Service Temporarily Unavailable” when trying to launch Serviceguard Manager Ensure that a loop back address is mentioned in the /etc/ hosts file 127.0.0.1 localhost.localdomain localhost Tomcat process has not started by any chance Run the Tomcat startup command /opt/hp/hpsmh/tomcat/bin/startup.sh The following message is displayed when erviceguard 1.
PAGE 264
PAGE 265
A Designing Highly Available Cluster Applications This appendix describes how to create or port applications for high availability, with emphasis on the following topics: • Automating Application Operation • Controlling the Speed of Application Failover (page 266) • Designing Applications to Run on Multiple Systems (page 269) • Restoring Client Connections (page 272) • Handling Application Failures (page 273) • Minimizing Planned Downtime (page 274) Designing for high availability means reducing
PAGE 266
• Minimize the reentry of data. • Engineer the system for reserve capacity to minimize the performance degradation experienced by users. A.1.2 Define Application Startup and Shutdown Applications must be restartable without manual intervention. If the application requires a switch to be flipped on a piece of hardware, then automated restart is impossible. Procedures for application startup, shutdown and monitoring must be created so that the HA software can perform these functions automatically.
PAGE 267
running the application. After failover, if these data disks are filesystems, they must go through filesystems recovery (fsck) before the data can be accessed. To help reduce this recovery time, the smaller these filesystems are, the faster the recovery will be. Therefore, it is best to keep anything that can be replicated off the data filesystem. For example, there should be a copy of the application executables on each system rather than having one copy of the executables on a shared filesystem.
PAGE 268
the beginning. This capability makes the application more robust and reduces the visibility of a failover to the user. A common example is a print job. Printer applications typically schedule jobs. When that job completes, the scheduler goes on to the next job.
PAGE 269
A.2.7 Design for Replicated Data Sites Replicated data sites are a benefit for both fast failover and disaster recovery. With replicated data, data disks are not shared between systems. There is no data recovery that has to take place. This makes the recovery time faster. However, there may be performance trade-offs associated with replicating data. There are a number of ways to perform data replication, which should be fully investigated by the application designer.
PAGE 270
A.3.1.1 Obtain Enough IP Addresses Each application receives a relocatable IP address that is separate from the stationary IP address assigned to the system itself. Therefore, a single system might have many IP addresses, one for itself and one for each of the applications that it normally runs. Therefore, IP addresses in a given subnet range will be consumed faster than without high availability. It might be necessary to acquire additional IP addresses.
PAGE 271
over time if the application migrates. Applications that use gethostname() to determine the name for a call to gethostbyname(3) should also be avoided for the same reason. Also, the gethostbyaddr() call may return different answers over time if called with a stationary IP address. Instead, the application should always refer to the application name and relocatable IP address rather than the hostname and stationary IP address.
PAGE 272
With UDP datagram sockets, however, there is a problem. The client may connect to multiple servers utilizing the relocatable IP address and sort out the replies based on the source IP address in the server’s response message. However, the source IP address given in this response will be the stationary IP address rather than the relocatable application IP address.
PAGE 273
give up after 2 minutes and go for coffee and don't come back for 28 minutes, the perceived downtime is actually 30 minutes, not 5. Factors to consider are the number of reconnection attempts to make, the frequency of reconnection attempts, and whether or not to notify the user of connection loss. There are a number of strategies to use for client reconnection: • Design clients which continue to try to reconnect to their failed server.
PAGE 274
Ideally, if one process fails, the other processes can wait a period of time for that component to come back online. This is true whether the component is on the same system or a remote system. The failed component can be restarted automatically on the same system and rejoin the waiting processing and continue on. This type of failure can be detected and restarted within a few seconds, so the end user would never know a failure occurred.
PAGE 275
The trade-off is that the application software must operate with different revisions of the software. In the above example, the database server might be at revision 5.0 while the some of the application servers are at revision 4.0. The application must be designed to handle this type of situation. A.6.1.2 Do Not Change the Data Layout Between Releases Migration of the data to a new format can be very time intensive. It also almost guarantees that rolling upgrade will not be possible.
PAGE 276
PAGE 277
B Integrating HA Applications with Serviceguard The following is a summary of the steps you should follow to integrate an application into the Serviceguard environment: 1. Read the rest of this book, including the chapters on cluster and package configuration, and the appendix “Designing Highly Available Cluster Applications.” 2.
PAGE 278
B.1.1 Defining Baseline Application Behavior on a Single System 1. Define a baseline behavior for the application on a standalone system: • Install the application, database, and other required resources on one of the systems. Be sure to follow Serviceguard rules in doing this: ◦ Install all shared data on separate external volume groups. ◦ Use a Journaled filesystem (JFS) as appropriate. • Perform some sort of standard test to ensure the application is running correctly.
PAGE 279
# cmhaltpkg pkg1 # cmrunpkg -n node1 pkg1 # cmmodpkg -e pkg1 2. 3. • Fail one of the systems. For example, turn off the power on node 1. Make sure the package starts up on node 2. • Repeat failover from node 2 back to node 1. Be sure to test all combinations of application load during the testing. Repeat the failover processes under different application states such as heavy user load versus no user load, batch jobs versus online transactions, etc.
PAGE 280
PAGE 281
C Blank Planning Worksheets This appendix reprints blank versions of the planning worksheets described in the “Planning” chapter. You can duplicate any of these worksheets that you find useful and fill them in as a part of the planning process.
PAGE 282
Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply _______________________ ============================================================================ Tape Backup Power: Tape Unit __________________________
PAGE 283
Physical Volume Name: _________________ Physical Volume Name: _________________ Physical Volume Name: _________________ ============================================================================= Volume Group Name: ___________________________________ Physical Volume Name: _________________ Physical Volume Name: _________________ Physical Volume Name: _________________ C.
PAGE 284
Package AutoRun Enabled? ______ Node Failfast Enabled? ________ Failover Policy:_____________ Failback_policy:___________________________________ Access Policies: User:_________________ From node:_______ Role:_____________________________ User:_________________ From node:_______ Role:______________________________________________ Log level____ Log file:_______________________________________________________________________________________ Priority_____________ Successor_halt_timeout____________ dependency_n
PAGE 285
D IPv6 Network Support This appendix describes some of the characteristics of IPv6 network addresses, specifically: • IPv6 Address Types • Network Configuration Restrictions (page 288) • Configuring IPv6 on Linux (page 288) D.1 IPv6 Address Types Several IPv6 types of addressing schemes are specified in the RFC 2373 (IPv6 Addressing Architecture). IPv6 addresses are 128-bit identifiers for interfaces and sets of interfaces. There are various address formats for IPv6 defined by the RFC 2373.
PAGE 286
D.1.2 IPv6 Address Prefix IPv6 Address Prefix is similar to CIDR in IPv4 and is written in CIDR notation. An IPv6 address prefix is represented by the notation: IPv6-address/prefix-length where ipv6-address is an IPv6 address in any notation listed above and prefix-length is a decimal value representing how many of the leftmost contiguous bits of the address comprise the prefix. Example: fec0:0:0:1::1234/64 The first 64-bits of the address fec0:0:0:1 forms the address prefix.
PAGE 287
Table 18 80 bits 16 bits 32 bits zeros FFFF IPv4 address Example: ::ffff:192.168.0.1 D.1.4.3 Aggregatable Global Unicast Addresses The global unicast addresses are globally unique IPv6 addresses. This address format is very well defined in the RFC 2374 (An IPv6 Aggregatable Global Unicast Address Format). The format is: Table 19 3 13 8 24 16 64 bits FP TLA ID RES NLA ID SLA ID Interface ID where FP = Format prefix. Value of this is “001” for Aggregatable Global unicast addresses.
PAGE 288
“FF” at the beginning of the address identifies the address as a multicast address. The “flags” field is a set of 4 flags “000T”. The higher order 3 bits are reserved and must be zero. The last bit ‘T’ indicates whether it is permanently assigned or not. A value of zero indicates that it is permanently assigned otherwise it is a temporary assignment. The “scop” field is a 4-bit field which is used to limit the scope of the multicast group.
PAGE 289
D.3.1 Enabling IPv6 on Red Hat Linux Add the following lines to /etc/sysconfig/network: NETWORKING_IPV6=yes IPV6FORWARDING=no IPV6_AUTOCONF=no IPV6_AUTOTUNNEL=no # Enable global IPv6 initialization # Disable global IPv6 forwarding # Disable global IPv6 autoconfiguration # Disable automatic IPv6 tunneling D.3.
PAGE 290
D.3.5 Configuring a Channel Bonding Interface with Persistent IPv6 Addresses on SUSE Configure the following parameters in /etc/sysconfig/network/ifcfg-bond0: BOOTPROTO=static BROADCAST=10.0.2.255 IPADDR=10.0.2.10 NETMASK=255.255.0.0 NETWORK=0.0.2.
PAGE 291
E Using Serviceguard Manager HP Serviceguard Manager is a web-based, HP System Management Homepage (HP SMH) tool that replaces the functionality of the earlier Serviceguard management tools. Serviceguard Manager allows you to monitor, administer and configure a Serviceguard cluster from any system with a supported web browser. The Serviceguard Manager Main Page provides you with a summary of the health of the cluster including the status of each node and its packages.
PAGE 292
1. Enter the standard URL http://:2301/. For example, http://clusternode1.cup.hp.com:2301/ 2. When the System Management Homepage login screen appears, enter your login credentials and click Sign In. The System Management Homepage for the selected server appears. 3. From the Serviceguard Cluster box, click the name of the cluster. NOTE: If a cluster is not yet configured, you will not see the Serviceguard Cluster section on this screen.
PAGE 293
NOTE: Serviceguard Manager can be launched by HP Systems Insight Manager version 5.10 or later if Serviceguard Manager is installed on an HP Systems Insight Manager Central Management Server. For a Serviceguard A.11.19 cluster, Systems Insight Manager will attempt to launch Serviceguard Manager B.02.00 from one of the nodes in the cluster; for a Serviceguard A.11.18 cluster, Systems Insight Manager will attempt to launch Serviceguard Manager B.01.01 from one of the nodes in the cluster.
PAGE 294
PAGE 295
F Maximum and Minimum Values for Parameters Table 23 shows the range of possible values for cluster configuration parameters. Table 23 Minimum and Maximum Values of Cluster Configuration Parameters Cluster Parameter Minimum Value Maximum Value Default Value Member Timeout See MEMBER_TIMEOUT under “Cluster Configuration Parameters” in Chapter 4. See MEMBER_TIMEOUT under “Cluster Configuration Parameters” in Chapter 4.
PAGE 296
PAGE 297
G Monitoring Script for Generic Resources Monitoring scripts are the scripts written by an end-user and must contain the core logic to monitor a resource and set the status of a generic resource. These scripts are started as a part of the package start. • You can set the status/value of a simple/extended resource respectively using the cmsetresource(1m) command. • You can define the monitoring interval in the script.
PAGE 298
For resources of evaluation_type: before_package_start • Monitoring scripts can also be launched outside of the Serviceguard environment, init, rc scripts, etc. (Serviceguard does not monitor them). • The monitoring scripts for all the resources in a cluster of type before_package_start can be configured in a single multi-node package by using the services functionality and any packages that require the resources can mention the generic resource name in their package configuration file.
PAGE 299
generic_resource_evaluation_type before_package_start generic_resource_name generic_resource_evaluation_type lan1 before_package_start dependency_name dependency_condition dependency_location generic_resource_monitors generic_resource_monitors = up same_node Thus, the monitoring scripts for all the generic resources of type before_package_start are configured in one single multi-node package and any package that requires this generic resource can just configure the generic resource name.
PAGE 300
# * --------------------------------* # * The following utility functions are sourced in from $SG_UTILS * # * ($SGCONF/scripts/mscripts/utils.sh) and available for use: * # * * # * sg_log * # * * # * By default, only log messages with a log level of 0 will * # * be output to the log file.
PAGE 301
{ sg_log 5 "start_command" # ADD your service start steps here return 0 } ######################################################################### # # stop_command # # This function should define actions to take when the package halts # # ######################################################################### function stop_command { sg_log 5 "stop_command" # ADD your halt steps here exit 1 } ################ # main routine ################ sg_log 5 "customer defined monitor script" #####################
PAGE 302
PAGE 303
H HP Serviceguard Toolkit for Linux The HP Serviceguard Toolkits such as, Contributed Toolkit, NFS, EDB PPAS, Sybase, and Oracle Toolkits are used for the integration of applications such as, Apache, MySQL, NFS, Oracle database, EDB PPAS, Sybase, and so on with the Serviceguard for Linux environment. The Toolkit documentation describes how to customize the package for your needs. For more information, see the Release Notes of these toolkits at http://www.hp.com/go/linux-serviceguard-docs.
PAGE 304
PAGE 305
Index A Access Control Policies, 158 active node, 20 adding a package to a running cluster, 242 adding cluster nodes advance planning, 132 adding nodes to a running cluster, 212 adding packages on a running cluster, 198 administration adding nodes to a running cluster, 212 halting a package, 218 halting the entire cluster, 213 moving a package, 219 of packages and services, 217 of the cluster, 211 reconfiguring a package while the cluster is running, 241 reconfiguring a package with the cluster offline, 242
PAGE 306
cluster node parameter, 91, 93 defined, 38 dynamic re-formation, 40 heartbeat subnet parameter, 95 initial configuration of the cluster, 38 main functions, 38 maximum configured packages parameter, 104 member timeout parameter, 99 monitored non-heartbeat subnet, 97 network polling interval parameter, 100, 104 planning the configuration, 91 quorum server parameter, 93 testing, 250 cluster node parameter in cluster manager configuration, 91, 93 cluster parameters initial configuration, 38 cluster re-formation
PAGE 307
planning for, 107 explanations package parameters, 174 F failback policy used by package manager, 50 FAILBACK_POLICY parameter used by package manager, 50 failover controlling the speed in applications, 266 defined, 20 failover behavior in packages, 108 failover package, 43, 170 failover policy used by package manager, 47 FAILOVER_POLICY parameter used by package manager, 47 failure kinds of responses, 75 network communication, 78 response to hardware failures, 76 responses to package and service failures,
PAGE 308
integrating HA applications with Serviceguard, 277 introduction Serviceguard at a glance, 19 understanding Serviceguard hardware, 25 understanding Serviceguard software, 33 IP in sample package control script, 236 IP address adding and deleting in packages, 63 for nodes and packages, 62 hardware planning, 82, 85 portable, 62 reviewing for packages, 254 switching, 45, 46, 70 IP_MONITOR defined, 102 iSCSI, 29 J JFS, 267 K kernel hang, and TOC, 75 safety timer, 34 kernel consistency in cluster configuration,
PAGE 309
for clusters, 140 networking redundant subnets, 81 networks binding to IP addresses, 271 binding to port addresses, 271 IP addresses and naming, 269 node and package IP addresses, 62 packages using IP addresses, 270 supported types in Serviceguard, 25 writing network applications as HA services, 266 no cluster lock choosing, 42 node basic concepts, 25 halt (TOC), 75 in Serviceguard cluster, 19 IP addresses, 62 timeout and TOC example, 76 node types active, 20 primary, 20 NODE_FAIL_FAST_ENABLED effect of set
PAGE 310
cluster manager configuration, 91 disk I/O information, 83 for expansion, 107 hardware configuration, 81 high availability objectives, 79 overview, 79 package configuration, 104 power, 84 quorum server, 85 SPU information, 81 volume groups and physical volumes, 85 worksheets, 83 planning and documenting an HA cluster, 79 planning for cluster expansion, 79 planning worksheets blanks, 281 point of failure in networking, 26 POLLING_TARGET defined, 103 ports dual and single aggregated, 65 power planning power s
PAGE 311
in sample package control script, 236 Serviceguard install, 135 introduction, 19 Serviceguard at a Glance, 19 Serviceguard behavior in LAN failure, 25 in monitored resource failure, 25 in software failure, 25 Serviceguard commands to configure a package, 234 Serviceguard Manager, 22 overview, 22 Serviceguard software components figure, 33 serviceguard WBEM provider, 37 shared disks planning, 83 shutdown and startup defined for applications, 266 single point of failure avoiding, 19 single-node operation, 166
PAGE 312
in package control script, 236 VGChange, 191 volume group for cluster lock, 40, 41 planning, 85 volume group and physical volume planning, 85 W WEIGHT_DEFAULT defined, 103 WEIGHT_NAME defined, 103 What is Serviceguard?, 19 worksheet blanks, 281 cluster configuration, 104, 283 hardware configuration, 83, 281 package configuration, 283, 284 power supply configuration, 84, 281, 282 use in planning, 79 312 Index