Managing HP Serviceguard for Linux, Seventh Edition, July 2007

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux ProLiant Cluster

Managing

HP Serviceguard for Linux,

Seventh Edition

Manufacturing Part Number : B9903-90054

July 2007

Summary of content (378 pages)

PAGE 1
Managing HP Serviceguard for Linux, Seventh Edition Manufacturing Part Number : B9903-90054 July 2007
PAGE 2
Legal Notices The information in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this manual, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be held liable for errors contained herein or direct, indirect, special, incidental or consequential damages in connection with the furnishing, performance, or use of this material. Warranty.
PAGE 3
Trademark Notices HP Serviceguard® is a registered trademark of Hewlett-Packard Company, and is protected by copyright. NIS™ is a trademark of Sun Microsystems, Inc. UNIX® is a registered trademark of The Open Group. Linux® is a registered trademark of Linus Torvalds. Red Hat® is a registered trademark of Red Hat Software, Inc. SUSE® is a registered trademark of SUSE AG, a Novell Business.
PAGE 4
PAGE 5
Contents 1. Serviceguard for Linux at a Glance What is Serviceguard for Linux? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Serviceguard Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monitoring Clusters with Serviceguard Manager . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 6
Contents How Packages Run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Makes a Package Run? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Before the Control Script Starts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . During Run Script Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 7
Contents Disk I/O Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Hardware Configuration Worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Power Supply Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Power Supply Configuration Worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Cluster Lock Planning . . . . . .
PAGE 8
Contents Implementing Channel Bonding (SUSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Restarting Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Creating the Logical Volume Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Displaying Disk Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Creating Partitions . . . . . . . . . . .
PAGE 9
Contents Before You Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . cmmakepkg Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Next Step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Editing the Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 10
Contents Creating the Package Control Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Verifying the Package Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributing the Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reconfiguring a Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reconfiguring a Package on a Running Cluster . . .
PAGE 11
Contents System Administration Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Package Movement Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Node and Network Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quorum Server Messages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lock LUN Messages . . . . . . . . . . . . . . . . . .
PAGE 12
Contents C. Integrating HA Applications with Serviceguard Checklist for Integrating HA Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Defining Baseline Application Behavior on a Single System . . . . . . . . . . . . . . . . . Integrating HA Applications in Multiple Systems . . . . . . . . . . . . . . . . . . . . . . . . . Testing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 344 345 346 D.
PAGE 13
Printing History Table 1 Printing Date Part Number Edition November 2001 B9903-90005 First November 2002 B9903-90012 First December 2002 B9903-90012 Second November 2003 B9903-90033 Third February 2005 B9903-90043 Fourth June 2005 B9903-90046 Fifth August 2006 B9903-90050 Sixth July 2007 B9903-90054 Seventh The last printing date and part number indicate the current edition, which applies to the A.11.18 version of HP Serviceguard for Linux.
PAGE 14
PAGE 15
Preface This guide describes how to configure and manage Serviceguard for Linux on HP ProLiant and HP Integrity servers under the Linux operating system. It is intended for experienced Linux system administrators. (For Linux system administration tasks that are not specific to Serviceguard, use the system administration documentation and manpages for your distribution of Linux.
PAGE 16
Related Publications • Appendix C, “Integrating HA Applications with Serviceguard,” presents suggestions for integrating your existing applications with Serviceguard for Linux. • Appendix D, “Blank Planning Worksheets,” contains a set of empty worksheets for preparing a Serviceguard configuration. The following documents contain additional useful information: • HP Serviceguard for Linux Version A.11.18 Release Notes • HP Serviceguard Quorum Server Version A.02.
PAGE 17
Serviceguard for Linux at a Glance 1 Serviceguard for Linux at a Glance This chapter introduces Serviceguard for Linux and shows where to find different kinds of information in this book.
PAGE 18
Serviceguard for Linux at a Glance What is Serviceguard for Linux? What is Serviceguard for Linux? Serviceguard for Linux allows you to create high availability clusters of HP ProLiant and HP Integrity servers. A high availability computer system allows application services to continue in spite of a hardware or software failure. Highly available systems protect users from software failures as well as from failure of a system processing unit (SPU), disk, or local area network (LAN) component.
PAGE 19
Serviceguard for Linux at a Glance What is Serviceguard for Linux? In the figure, node 1 (one of two SPU's) is running package A, and node 2 is running package B. Each package has a separate group of disks associated with it, containing data needed by the package's applications, and a copy of the data. Note that both nodes are physically connected to disk arrays. However, only one node at a time may access the data for a given group of disks.
PAGE 20
Serviceguard for Linux at a Glance What is Serviceguard for Linux? Failover Under normal conditions, a fully operating Serviceguard cluster simply monitors the health of the cluster's components while the packages are running on individual nodes. Any host system running in the Serviceguard cluster is called an active node. When you create the package, you specify a primary node and one or more adoptive nodes.
PAGE 21
Serviceguard for Linux at a Glance What is Serviceguard for Linux? Serviceguard is designed to work in conjunction with other high availability products, such as disk arrays, which use various RAID levels for data protection; and HP-supported uninterruptible power supplies (UPS), such as HP PowerTrust, which eliminates failures related to power outage. These products are highly recommended along with Serviceguard to provide the greatest degree of availability.
PAGE 22
Serviceguard for Linux at a Glance Using Serviceguard Manager Using Serviceguard Manager Serviceguard Manager is a web-based, HP System Management Homepage (HP SMH) plug-in application that replaces the functionality of the earlier Serviceguard management tools. HP Serviceguard Manager allows you to monitor, administer and configure a Serviceguard A.11.18 cluster from any system with a web browser: • Monitor: you can see properties, status, and alerts of cluster, nodes, and packages.
PAGE 23
Serviceguard for Linux at a Glance Using Serviceguard Manager Configuring Clusters with Serviceguard Manager You can configure clusters and legacy packages in Serviceguard Manager; modular packages must be configured by means of Serviceguard commands (see “How the Package Manager Works” on page 46; “Configuring Packages and Their Services” on page 191; and “Configuring a Legacy Package” on page 262). You must have root (UID=0) access to the cluster nodes.
PAGE 24
Serviceguard for Linux at a Glance Configuration Roadmap Configuration Roadmap This manual presents the tasks you need to perform in order to create a functioning HA cluster using Serviceguard. These tasks are shown in Figure 1-3. Figure 1-3 Tasks in Configuring a Serviceguard Cluster The tasks in Figure 1-3 are covered in step-by-step detail in chapters 4 through 7. It is strongly recommended that you gather all the data that is needed for configuration before you start.
PAGE 25
Understanding Hardware Configurations for Serviceguard for Linux 2 Understanding Hardware Configurations for Serviceguard for Linux This chapter gives a broad overview of how the server hardware components operate with Serviceguard for Linux. The following topics are presented: • Redundancy of Cluster Components • Redundant Network Components • Redundant Disk Storage • Redundant Power Supplies Refer to the next chapter for information about Serviceguard software components.
PAGE 26
Understanding Hardware Configurations for Serviceguard for Linux Redundancy of Cluster Components Redundancy of Cluster Components In order to provide a high level of availability, a typical cluster uses redundant system components, for example two or more SPUs and two or more independent disks. Redundancy eliminates single points of failure. In general, the more redundancy, the greater your access to applications, data, and supportive services in the event of a failure.
PAGE 27
Understanding Hardware Configurations for Serviceguard for Linux Redundant Network Components Redundant Network Components To eliminate single points of failure for networking, each subnet accessed by a cluster node is required to have redundant network interfaces. Redundant cables are also needed to protect against cable failures. Each interface card is connected to a different cable and hub or switch. Network interfaces are allowed to share IP addresses through a process known as channel bonding.
PAGE 28
Understanding Hardware Configurations for Serviceguard for Linux Redundant Network Components In Linux configurations, the use of symmetrical LAN configurations is strongly recommended, with the use of redundant hubs or switches to connect Ethernet segments. The software bonding configurations also should be identical on both nodes, with the active interfaces being connected to the same hub or switch.
PAGE 29
Understanding Hardware Configurations for Serviceguard for Linux Redundant Disk Storage Redundant Disk Storage Each node in a cluster has its own root disk, but each node may also be physically connected to several other disks in such a way that more than one node can obtain access to the data and programs associated with a package it is configured for. This access is provided by the Logical Volume Manager (LVM).
PAGE 30
Understanding Hardware Configurations for Serviceguard for Linux Redundant Disk Storage disk failure occurs on one node, the monitor will cause the package to fail, with the potential to fail over to a different node on which the same disks are available. Sample Disk Configurations Figure 2-2 shows a two node cluster. Each node has one root disk which is mirrored and one package for which it is the primary node.
PAGE 31
Understanding Hardware Configurations for Serviceguard for Linux Redundant Power Supplies Redundant Power Supplies You can extend the availability of your hardware by providing battery backup to your nodes and disks. HP-supported uninterruptible power supplies (UPS) can provide this protection from momentary power loss. Disks should be attached to power circuits in such a way that disk array copies are attached to different power sources.
PAGE 32
Understanding Hardware Configurations for Serviceguard for Linux Redundant Power Supplies 32 Chapter 2
PAGE 33
Understanding Serviceguard Software Components 3 Understanding Serviceguard Software Components This chapter gives a broad overview of how the Serviceguard software components work.
PAGE 34
Understanding Serviceguard Software Components Serviceguard Architecture Serviceguard Architecture The following figure shows the main software components used by Serviceguard for Linux. This chapter discusses these components in some detail.
PAGE 35
Understanding Serviceguard Software Components Serviceguard Architecture Each of these daemons logs to the Linux system logging files. The quorum server daemon logs to the user specified log file, such as, /usr/local/qs/log/qs.log file on Red Hat or /var/log/qs/sq.log on SUSE and cmomd logs to /usr/local/cmom/log/cmomd.log on Red Hat or /var/log/cmom/log/cmomd.log on SUSE.
PAGE 36
Understanding Serviceguard Software Components Serviceguard Architecture NOTE The file cmcluster.conf contains the mappings that resolve the symbolic references to $SGCONF, $SGROOT, $SGLBIN etc., used in this manual. See “Understanding the Location of Serviceguard Files” on page 140 for details. NOTE Two of the central components of Serviceguard—Package Manager, and Cluster Manager—run as parts of the cmcld daemon. This daemon runs at priority 94 and is in the SCHED_RR class.
PAGE 37
Understanding Serviceguard Software Components Serviceguard Architecture Cluster Object Manager Daemon: cmomd This daemon is responsible for providing information about the cluster to clients—external products or tools that depend on knowledge of the state of cluster objects. Clients send queries to cmomd and receive responses from it. Clients send queries to the object manager and receive responses from it (this communication is done indirectly, through a Serviceguard API).
PAGE 38
Understanding Serviceguard Software Components Serviceguard Architecture All members of the cluster initiate and maintain a connection to the quorum server. If the quorum server dies, the Serviceguard nodes will detect this and then periodically try to reconnect to the quorum server until it comes back up. If there is a cluster reconfiguration while the quorum server is down and there is a partition in the cluster that requires tie-breaking, the reconfiguration will fail.
PAGE 39
Understanding Serviceguard Software Components How the Cluster Manager Works How the Cluster Manager Works The cluster manager is used to initialize a cluster, to monitor the health of the cluster, to recognize node failure if it should occur, and to regulate the re-formation of the cluster when a node joins or leaves the cluster. The cluster manager operates as a daemon process that runs on each node.
PAGE 40
Understanding Serviceguard Software Components How the Cluster Manager Works as before. In such cases, packages do not halt or switch, though the application may experience a slight performance impact during the re-formation. If heartbeat and data are sent over the same LAN subnet, data congestion may cause Serviceguard to miss heartbeats during the period of the heartbeat timeout and initiate a cluster re-formation that would not be needed if the congestion had not occurred.
PAGE 41
Understanding Serviceguard Software Components How the Cluster Manager Works During startup, the cluster manager software checks to see if all nodes specified in the startup command are valid members of the cluster, are up and running, are attempting to form a cluster, and can communicate with each other. If they can, then the cluster manager forms the cluster. Automatic Cluster Startup An automatic cluster startup occurs any time a node reboots and joins the cluster.
PAGE 42
Understanding Serviceguard Software Components How the Cluster Manager Works Cluster Quorum to Prevent Split-Brain Syndrome In general, the algorithm for cluster re-formation requires a cluster quorum of a strict majority (that is, more than 50%) of the nodes previously running. If both halves (exactly 50%) of a previously running cluster were allowed to re-form, there would be a split-brain situation in which two instances of the same cluster were running.
PAGE 43
Understanding Serviceguard Software Components How the Cluster Manager Works The operation of the lock LUN is shown in Figure 3-2. Figure 3-2 Lock LUN Operation Serviceguard periodically checks the health of the lock LUN and writes messages to the syslog file if the disk fails the health check. This file should be monitored for early detection of lock disk problems. Use of the Quorum Server as a Cluster Lock The cluster lock in Linux can also be implemented by means of a quorum server.
PAGE 44
Understanding Serviceguard Software Components How the Cluster Manager Works The operation of the quorum server is shown in Figure 3-3. When there is a loss of communication between node 1 and node 2, the quorum server chooses one node (in this example, node 2) to continue running in the cluster. The other node halts. Figure 3-3 Quorum Server Operation Types of Quorum Server Configuration The quorum server can be configured as a Serviceguard package or as a stand alone installation.
PAGE 45
Understanding Serviceguard Software Components How the Cluster Manager Works If the first cluster created, in a group of clusters, needs a quorum device, that cluster must use a stand alone quorum server or lock LUN. Figure 3-4 illustrates quorum server use across four clusters. Figure 3-4 Quorum Server to Cluster Distribution HP recommends that two clusters running quorum server packages do not provide quorum services for each other.
PAGE 46
Understanding Serviceguard Software Components How the Package Manager Works How the Package Manager Works Packages are the means by which Serviceguard starts and halts configured applications. A package is a collection of services, disk volumes and IP addresses that are managed by Serviceguard to ensure they are available. Each node in the cluster runs an instance of the package manager; the package manager residing on the cluster coordinator is known as the package coordinator.
PAGE 47
Understanding Serviceguard Software Components How the Package Manager Works Failover Packages A failover package starts up on an appropriate node when the cluster starts. A package failover takes place when the package coordinator initiates the start of a package on a new node. A package failover involves both halting the existing package (in the case of a service, network, or resource failure), and starting the new instance of the package.
PAGE 48
Understanding Serviceguard Software Components How the Package Manager Works Deciding When and Where to Run and Halt Failover Packages The package configuration file assigns a name to the package and includes a list of the nodes on which the package can run. Failover packages list the nodes in order of priority (i.e., the first node in the list is the highest priority node). In addition, failover packages’ files contain three parameters that determine failover behavior.
PAGE 49
Understanding Serviceguard Software Components How the Package Manager Works Figure 3-6 Before Package Switching Figure 3-7 shows the condition where Node 1 has failed and Package 1 has been transferred to Node 2. Package 1's IP address was transferred to Node 2 along with the package. Package 1 continues to be available and is now running on Node 2. Also note that Node 2 can now access both Package1’s disk and Package2’s disk.
PAGE 50
Understanding Serviceguard Software Components How the Package Manager Works Figure 3-7 After Package Switching Failover Policy The Package Manager selects a node for a failover package to run on based on the priority list included in the package configuration file together with the failover_policy parameter, also in the configuration file.
PAGE 51
Understanding Serviceguard Software Components How the Package Manager Works If you use min_package_node as the value for the failover policy, the package will start up on the node that is currently running the fewest other packages. (Note that this does not mean the lightest load; the only thing that is checked is the number of packages currently running on the node.
PAGE 52
Understanding Serviceguard Software Components How the Package Manager Works If a failure occurs, any package would fail over to the node containing fewest running packages, as in Figure 3-9, which shows a failure on node 2: Figure 3-9 Rotating Standby Configuration after Failover NOTE Using the min_package_node policy, when node 2 is repaired and brought back into the cluster, it will then be running the fewest packages, and thus will become the new standby node.
PAGE 53
Understanding Serviceguard Software Components How the Package Manager Works Figure 3-10 CONFIGURED_NODE Policy Packages after Failover If you use configured_node as the failover policy, the package will start up on the highest priority node in the node list, assuming that the node is running as a member of the cluster. When a failover occurs, the package will move to the next highest priority node in the list that is available.
PAGE 54
Understanding Serviceguard Software Components How the Package Manager Works Figure 3-11 Automatic Failback Configuration before Failover Table 3-2 Node Lists in Sample Cluster Package Name NODE_NAME List FAILOVER POLICY FAILBACK POLICY pkgA node1, node4 CONFIGURED_NODE AUTOMATIC pkgB node2, node4 CONFIGURED_NODE AUTOMATIC pkgC node3, node4 CONFIGURED_NODE AUTOMATIC Node1 panics, and after the cluster reforms, pkgA starts running on node4: Figure 3-12 54 Automatic Failback Configuration
PAGE 55
Understanding Serviceguard Software Components How the Package Manager Works After rebooting, node 1 rejoins the cluster. At that point, pkgA will be automatically stopped on node 4 and restarted on node 1. Figure 3-13 Automatic Failback Configuration After Restart of Node 1 NOTE Setting the failback_policy to automatic can result in a package failback and application outage during a critical production period.
PAGE 56
Understanding Serviceguard Software Components How the Package Manager Works For full details of the current parameters and their default values, see Chapter 6, “Configuring Packages and Their Services,” on page 191, and the package configuration file template itself. Choosing Package Failover Behavior To determine failover behavior, you can define a package failover policy that governs which nodes will automatically start up a package that is not running.
PAGE 57
Understanding Serviceguard Software Components How the Package Manager Works Table 3-3 Package Failover Behavior (Continued) Options in Serviceguard Manager Switching Behavior Parameters in Configuration File Package is automatically halted and restarted on its primary node if the primary node is available and the package is running on a non-primary node. • Failback policy set to automatic. • failback_policy set to automatic.
PAGE 58
Understanding Serviceguard Software Components How Packages Run How Packages Run Packages are the means by which Serviceguard starts and halts configured applications. Failover packages are also units of failover behavior in Serviceguard. A package is a collection of services, disk volumes and IP addresses that are managed by Serviceguard to ensure they are available. There can be a maximum of 150 packages per cluster and a total of 900 services per cluster.
PAGE 59
Understanding Serviceguard Software Components How Packages Run nodes, or that the package has a dependency that is not being met. When a package has failed on one node and is enabled to switch to another node, it will start up automatically in a new location where its dependencies are met. This process is known as package switching, or remote switching. A failover package starts on the first available node in its configuration file; by default, it fails over to the next available one in the list.
PAGE 60
Understanding Serviceguard Software Components How Packages Run NOTE This diagram applies specifically to legacy packages. Differences for modular scripts are called out below. Figure 3-14 Legacy Package Time Line Showing Important Events The following are the most important moments in a package’s life: 1. Before the control script starts. (For modular packages, this is the master control script.) 2. During run script execution.
PAGE 61
Understanding Serviceguard Software Components How Packages Run Before the Control Script Starts First, a node is selected. This node must be in the package’s node list, it must conform to the package’s failover policy, and any resources required by the package must be available on the chosen node. One resource is the subnet that is monitored for the package. If the subnet is not available, the package cannot start on this node. Another type of resource is a dependency on another package.
PAGE 62
Understanding Serviceguard Software Components How Packages Run Figure 3-15 Legacy Package Time Line At any step along the way, an error will result in the script exiting abnormally (with an exit code of 1). For example, if a package service is unable to be started, the control script will exit with an error. NOTE This diagram is specific to legacy packages. Modular packages also run external scripts and “pre-scripts” as explained above.
PAGE 63
Understanding Serviceguard Software Components How Packages Run the package is running. If a number of Restarts is specified for a service in the package control script, the service may be restarted if the restart count allows it, without re-running the package run script. Normal and Abnormal Exits from the Run Script Exit codes on leaving the run script determine what happens to the package next.
PAGE 64
Understanding Serviceguard Software Components How Packages Run legacy package; for more information about configuring services in modular packages, see the discussion starting on page 209, and the comments in the package configuration template file.
PAGE 65
Understanding Serviceguard Software Components How Packages Run Package halting normally means that the package halt script executes (see the next section). However, if a failover package’s configuration has the service_fail_fast_enabled flag set to yes for the service that fails, then the node will halt as soon as the failure is detected. If this flag is not set, the loss of a service will result in halting the package gracefully by running the halt script.
PAGE 66
Understanding Serviceguard Software Components How Packages Run During Halt Script Execution Once the package manager has detected the failure of a service or package that a failover package depends on, or when the cmhaltpkg command has been issued for a particular package, the package manager launches the halt script. That is, a package’s control script or master control script is executed with the stop parameter. This script carries out the following steps (also shown in Figure 3-16): 1.
PAGE 67
Understanding Serviceguard Software Components How Packages Run messages are written to a log file. For legacy packages, this is in the same directory as the run script and has the same name as the run script and the extension .log. For modular packages, the pathname is determined by the script_log_file parameter in the package configuration file (see page 204). Normal starts are recorded in the log, together with error messages or warnings related to halting the package.
PAGE 68
Understanding Serviceguard Software Components How Packages Run Table 3-4 Error Conditions and Package Movement for Failover Packages Package Error Condition Results Linux Status on Primary after Error Halt script runs after Error or Exit Package Allowed to Run on Primary Node after Error Package Allowed to Run on Alternate Node Node Failfast Enabled Service Failfast Enabled Service Failure YES YES system reset No N/A (system reset) Yes Service Failure NO YES system reset No N/A (system
PAGE 69
Understanding Serviceguard Software Components How Packages Run Table 3-4 Error Conditions and Package Movement for Failover Packages Package Error Condition Results Halt script runs after Error or Exit Package Allowed to Run on Primary Node after Error Node Failfast Enabled Service Failfast Enabled Linux Status on Primary after Error Halt Script Timeout YES Either Setting system reset N/A N/A (system reset) Yes, unless the timeout happened after the cmhaltpkg command was executed.
PAGE 70
Understanding Serviceguard Software Components How Packages Run Table 3-4 Error Conditions and Package Movement for Failover Packages Package Error Condition Results Linux Status on Primary after Error Halt script runs after Error or Exit Package Allowed to Run on Primary Node after Error Package Allowed to Run on Alternate Node Node Failfast Enabled Service Failfast Enabled Loss of Monitored Resource NO Either Setting Running Yes Yes, if the resource is not a deferred resource.
PAGE 71
Understanding Serviceguard Software Components How the Network Manager Works How the Network Manager Works The purpose of the network manager is to detect and recover from network card and cable failures so that network services remain highly available to clients. In practice, this means assigning IP addresses for each package to the primary LAN interface card on the node where the package is running and monitoring the health of all interfaces.
PAGE 72
Understanding Serviceguard Software Components How the Network Manager Works In addition to the stationary IP address, you normally assign one or more unique IP addresses to each package. The package IP address is assigned to the primary LAN interface card. The IP addresses associated with a package are called relocatable IP addresses (also known as IP aliases, package IP addresses or floating IP addresses) because the addresses can actually move from one cluster node to another.
PAGE 73
Understanding Serviceguard Software Components How the Network Manager Works Bonding of LAN Interfaces On the local node, several LAN interfaces can be grouped together in a process known in Linux as channel bonding. In the bonded group, one interface is used to transmit and receive data, while the others are available as backups. If one interface fails, another interface in the bonded group takes over.
PAGE 74
Understanding Serviceguard Software Components How the Network Manager Works address and MAC address. In this example, the aggregated ports are collectively known as bond0, and this is the name by which the bond is known during cluster configuration. Figure 3-18 shows a bonded configuration using redundant hubs with a crossover cable.
PAGE 75
Understanding Serviceguard Software Components How the Network Manager Works After the failure of a card, messages are still carried on the bonded LAN and are received on the other node, but now eth1 has become active in bond0 on node1. This situation is shown in Figure 3-19.
PAGE 76
Understanding Serviceguard Software Components How the Network Manager Works Bonding for Load Balancing It is also possible to configure bonds in load balancing mode, which allows all slaves to transmit data in parallel, in an active/active arrangement. In this case, high availability is provided by the fact that the bond still continues to function (with less throughput) if one of the component LANs should fail.
PAGE 77
Understanding Serviceguard Software Components How the Network Manager Works Remote Switching A remote switch (that is, a package switch) involves moving packages and their associated IP addresses to a new system. The new system must already have the same subnetwork configured and working properly, otherwise the packages will not be started. With remote switching, TCP connections are lost. TCP applications must reconnect to regain connectivity; this is not handled automatically.
PAGE 78
Understanding Serviceguard Software Components Volume Managers for Data Storage Volume Managers for Data Storage A volume manager is a tool that lets you create units of disk storage that are more flexible than individual disk partitions. These units can be used on single systems or in high availability clusters. HP Serviceguard for Linux uses the Linux Logical Volume Manager (LVM) which creates redundant storage groups. This section provides an overview of volume management with LVM.
PAGE 79
Understanding Serviceguard Software Components Volume Managers for Data Storage Examples of Storage on Smart Arrays Figure 3-21 shows an illustration of storage configured on a Smart MSA500 storage. Physical disks are configured by an array utility program into logical units or LUNs which are then seen by the operating system. Figure 3-21 Physical Disks Combined into LUNs NOTE LUN definition is normally done using utility programs provided by the disk array manufacturer.
PAGE 80
Understanding Serviceguard Software Components Volume Managers for Data Storage Figure 3-22 shows LUNs configured with a Smart Array cluster storage with single pathways to the data.
PAGE 81
Understanding Serviceguard Software Components Volume Managers for Data Storage Finally, the Smart Array LUNs are configured into volume groups as shown in Figure 3-23. Figure 3-23 Smart Array LUNs Configured in Volume Groups Figure 3-24 shows an illustration of storage configured on a disk array. Physical disks are configured by an array utility program into logical units or LUNs which are then seen by the operating system.
PAGE 82
Understanding Serviceguard Software Components Volume Managers for Data Storage NOTE LUN definition is normally done using utility programs provided by the disk array manufacturer. Since arrays vary considerably, make sure you read the documentation that accompanies your storage unit. Multipathing and LVM How multipathing is implemented depends on the storage sub-system attached to the cluster and the HBA in the servers. Check the documentation that accompanied your storage sub-system and HBA.
PAGE 83
Understanding Serviceguard Software Components Volume Managers for Data Storage NOTE Do not use the MD configuration parameters in the legacy package control script for MD multipath. Instead, for multipath only, MD activation can be done at system boot. Monitoring Disks Each package configuration includes information about the disks that are to be activated by the package at startup. If monitoring is used, the health of the disks is checked at package startup.
PAGE 84
Understanding Serviceguard Software Components Responses to Failures Responses to Failures HP Serviceguard responds to different kinds of failures in specific ways. For most hardware failures, the response is not user-configurable, but for package and service failures, you can choose the system’s response, within limits. Reboot When a Node Fails The most dramatic response to a failure in a Serviceguard cluster is a system reboot.
PAGE 85
Understanding Serviceguard Software Components Responses to Failures 1. The node tries to reform the cluster. 2. If the node cannot get a quorum (if it cannot get the cluster lock) then 3. The node halts (system reset). Example Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02 is exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB respectively.
PAGE 86
Understanding Serviceguard Software Components Responses to Failures For more information on cluster failover, see the white paper Optimizing Failover Time in a Serviceguard Environment at http://www.docs.hp.com->High Availability->Serviceguard->White Papers.
PAGE 87
Understanding Serviceguard Software Components Responses to Failures Responses to Package and Service Failures In the default case, the failure of the package or of a service within a package causes the package to shut down by running the control script with the 'stop' parameter, and then restarting the package on an alternate node. A package will also fail if it is configured to have a dependency on another package, and that package fails.
PAGE 88
Understanding Serviceguard Software Components Responses to Failures Network Communication Failure An important element in the cluster is the health of the network itself. As it continuously monitors the cluster, each node listens for heartbeat messages from the other nodes confirming that all nodes are able to communicate with each other.
PAGE 89
Planning and Documenting an HA Cluster 4 Planning and Documenting an HA Cluster Building a Serviceguard cluster begins with a planning phase in which you gather and record information about all the hardware and software components of the configuration.
PAGE 90
Planning and Documenting an HA Cluster General Planning General Planning A clear understanding of your high availability objectives will quickly help you to define your hardware requirements and design your system. Use the following questions as a guide for general planning: 1. What applications must continue to be available in the event of a failure? 2. What system resources (processing power, networking, SPU, memory, disk space) are needed to support these applications? 3.
PAGE 91
Planning and Documenting an HA Cluster General Planning additional disk hardware for shared data storage. If you intend to expand your cluster without the need to bring it down, careful planning of the initial configuration is required. Use the following guidelines: • Set the Maximum Configured Packages parameter (described later in this chapter under “Cluster Configuration Planning” on page 105) high enough to accommodate the additional packages you plan to add.
PAGE 92
Planning and Documenting an HA Cluster Hardware Planning Hardware Planning Hardware planning requires examining the physical hardware itself. One useful procedure is to sketch the hardware configuration in a diagram that shows adapter cards and buses, cabling, disks and peripherals. Then record the information on the Hardware Worksheet (page 348). Indicate which device adapters occupy which slots.
PAGE 93
Planning and Documenting an HA Cluster Hardware Planning LAN Information While a minimum of one LAN interface per subnet is required, at least two LAN interfaces are needed to eliminate single points of network failure. It is recommended that you configure heartbeats on all subnets, including those to be used for client data. On the worksheet, enter the following for each LAN interface: Subnet Name Enter the IP address for the subnet.
PAGE 94
Planning and Documenting an HA Cluster Hardware Planning Information from this section of the worksheet is used in creating the subnet groupings and identifying the IP addresses in the configuration steps for the cluster manager and package manager.
PAGE 95
Planning and Documenting an HA Cluster Hardware Planning Shared Storage SCSI may be used for up to four node clusters, or FibreChannel can be used for clusters of up to 16 nodes.
PAGE 96
Planning and Documenting an HA Cluster Hardware Planning Multipath for Storage The method for achieving a multipath solution is dependent on the storage sub-system attached to the cluster and the HBA in the servers. Please check the documentation that accompanied your storage sub-system and HBA. For fibre channel attached storage, the multipath function within the HBA driver should be used, if it is supported by HP.
PAGE 97
Planning and Documenting an HA Cluster Hardware Planning NOTE MD also supports software RAID; but this is not currently supported with Serviceguard for Linux. Disk I/O Information This part of the worksheet lets you indicate where disk device adapters are installed. Enter the following items on the worksheet for each disk connected to each disk device adapter on the node: Bus Type Indicate the type of bus. Supported busses are F/W SCSI and FibreChannel.
PAGE 98
Planning and Documenting an HA Cluster Hardware Planning • ls /dev/hd* (non-SCSI/FibreChannel disks) • ls /dev/sd* (SCSI and FibreChannel disks) • du • df • mount • vgdisplay -v • lvdisplay -v See the man pages on these commands for information about specific usage. The commands should be issued from all nodes after installing the hardware and rebooting the system. The information will be useful when doing LVM and cluster configuration.
PAGE 99
Planning and Documenting an HA Cluster Hardware Planning Hardware Configuration Worksheet The Hardware configuration worksheet on page 348 will help you organize and record your specific cluster hardware configuration. Make as many copies as you need. Complete the worksheet and keep it for future reference.
PAGE 100
Planning and Documenting an HA Cluster Power Supply Planning Power Supply Planning There are two sources of power for your cluster which you will have to consider in your design: line power and uninterruptible power supplies (UPS). Loss of a power circuit should not bring down the cluster. No more than half of the nodes should be on a single power source.
PAGE 101
Planning and Documenting an HA Cluster Cluster Lock Planning Cluster Lock Planning The purpose of the cluster lock is to ensure that only one new cluster is formed in the event that exactly half of the previously clustered nodes try to form a new cluster. It is critical that only one new cluster is formed and that it alone has access to the disks specified in its packages. You can specify a lock LUN or a quorum server as the cluster lock.
PAGE 102
Planning and Documenting an HA Cluster Cluster Lock Planning • Can be used with up to 50 clusters, not exceeding 100 nodes total. • Can support a cluster with any supported number of nodes. Networking Recommendations for a Quorum Server • Ideally the Quorum Server and the cluster or clusters it serves should communicate over a subnet that does not handle other traffic. This helps to ensure that the Quorum Server is available when it is needed.
PAGE 103
Planning and Documenting an HA Cluster Cluster Lock Planning Supported Node Names Enter the name (39 characters or fewer) of each cluster node that will be supported by this quorum server. These entries will be entered into qs_authfile on the system that is running the quorum server process.
PAGE 104
Planning and Documenting an HA Cluster Volume Manager Planning Volume Manager Planning When designing your disk layout using LVM, you should consider the following: • The volume groups that contain high availability applications, services, or data must be on a bus or buses available to the primary node and all adoptive nodes. • High availability applications, services, and data should be placed in volume groups that are separate from non-high availability applications, services, and data.
PAGE 105
Planning and Documenting an HA Cluster Cluster Configuration Planning Cluster Configuration Planning A cluster should be designed to provide the quickest possible recovery from failures. The actual time required to recover from a failure depends on several factors: • The length of the cluster heartbeat interval and node timeout. See the parameter descriptions for HEARTBEAT_INTERVAL and NODE_TIMEOUT under “Cluster Configuration Parameters” starting on page 106 for recommendations.
PAGE 106
Planning and Documenting an HA Cluster Cluster Configuration Planning Quorum Server Information The quorum server (QS) provides tie-breaking services for Linux clusters. The QS is described in chapter 3 under “Cluster Quorum to Prevent Split-Brain Syndrome.” You can use either a lock LUN or a quorum server as a tie-breaker.
PAGE 107
Planning and Documenting an HA Cluster Cluster Configuration Planning The cluster name must not contain any of the following characters: space, slash (/), backslash (\), and asterisk (*). In addition, the following characters must not be used in the cluster name if you are using the Quorum Server: at-sign (@), equal-sign (=), or-sign (|), semicolon (;).
PAGE 108
Planning and Documenting an HA Cluster Cluster Configuration Planning NODE_NAME The hostname of each system that will be a node in the cluster. Do not use the full domain name. For example, enter ftsys9, not ftsys9.cup.hp.com. A cluster can contain up to 16 nodes.
PAGE 109
Planning and Documenting an HA Cluster Cluster Configuration Planning NOTE Heartbeat IP addresses must be on the same subnet on each node, and must be IPv4 addresses. For information about changing the configuration online, see “Changing the Cluster Networking Configuration while the Cluster Is Running” on page 256.
PAGE 110
Planning and Documenting an HA Cluster Cluster Configuration Planning STATIONARY_IP The IP address of each monitored subnet, other than those that carry the cluster heartbeat. You can identify any number of subnets to be monitored. If you want to separate application data from heartbeat messages, define monitored non-heartbeat subnets here. A stationary IP address can be either an IPv4 or an IPv6 address. For more information about IPv6 addresses, see Appendix E, “IPv6 Network Support,” on page 357.
PAGE 111
Planning and Documenting an HA Cluster Cluster Configuration Planning • To ensure the fastest cluster reformations, use the default value. But keep in mind that this setting can lead to reformations that are caused by short-lived system hangs or network load spikes. • For fewer reformations, use a setting in the range of 5,000,000 to 8,000,000 microseconds (5 to 8 seconds). But keep in mind that this will lead to slower reformations than the default value.
PAGE 112
Planning and Documenting an HA Cluster Cluster Configuration Planning NETWORK_POLLING_INTERVAL The frequency at which the networks configured for Serviceguard are checked. In the ASCII cluster configuration file, this parameter is NETWORK_POLLING_INTERVAL. Default is 2,000,000 microseconds in the ASCII file. Thus every 2 seconds, the network manager polls each network interface to make sure it can still send and receive information. Changing this value can affect how quickly a network failure is detected.
PAGE 113
Planning and Documenting an HA Cluster Package Configuration Planning Package Configuration Planning Planning for packages involves assembling information about each group of highly available services. NOTE As of Serviceguard A.11.18, there is a new and simpler way to configure packages.
PAGE 114
Planning and Documenting an HA Cluster Package Configuration Planning NOTE To prevent an operator from accidentally activating volume groups on other nodes in the cluster, versions A.11.16.07 and later of Serviceguard for Linux include a type of VG activation protection. This is based on the “hosttags” feature of LVM2. This feature is not mandatory, but HP strongly recommends you implement it as you upgrade existing clusters and create new ones.
PAGE 115
Planning and Documenting an HA Cluster Package Configuration Planning Create an entry for each logical volume, indicating its use for a file system or for a raw device. CAUTION Do not use /etc/fstab to mount file systems that are used by Serviceguard packages. For information about creating, exporting, and importing volume groups, see “Creating the Logical Volume Infrastructure” on page 163. Planning for Expansion You can add packages to a running cluster.
PAGE 116
Planning and Documenting an HA Cluster Package Configuration Planning Make a package dependent on another package if the first package cannot (or should not) function without the services provided by the second. For example, pkg1 might run a real-time web interface to a database managed by pkg2. In this case it might make sense to make pkg1 dependent on pkg2. In considering whether or not to create a dependency between packages, consider the Rules and Guidelines that follow.
PAGE 117
Planning and Documenting an HA Cluster Package Configuration Planning — Preferably the nodes should be listed in the same order if the dependency is between packages whose failover_policy is configured_node; cmcheckconf and cmapplyconf will warn you if they are not. • A package cannot depend on itself, directly or indirectly.
PAGE 118
Planning and Documenting an HA Cluster Package Configuration Planning The broad rule is that a higher-priority package can drag a lower-priority package, forcing it to start on, or move to, a node that suits the higher-priority package. NOTE This applies only when the packages are automatically started (package switching enabled); cmrunpkg will never force a package to halt. Keep in mind that you do not have to set priority, even when one or more packages depend on another.
PAGE 119
Planning and Documenting an HA Cluster Package Configuration Planning 3. All packages with no_priority are by definition of equal priority, and there is no other way to assign equal priorities; a numerical priority must be unique within the cluster. See priority (page 206) for more information. If pkg1 depends on pkg2, and pkg1’s priority is lower than or equal to pkg2’s, pkg2’s node order dominates.
PAGE 120
Planning and Documenting an HA Cluster Package Configuration Planning — If the priorities are equal, neither package will fail back (unless pkg1 is not running; in that case pkg2 can fail back). — If pkg2’s priority is higher than pkg1’s, pkg2 will fail back to node1; pkg1 will fail back to node1 provided all of pkg1’s other dependencies are met there; — if pkg2 has failed back to node1 and node1 does not meet all of pkg1’s dependencies, pkg1 will halt.
PAGE 121
Planning and Documenting an HA Cluster Package Configuration Planning Guidelines As you can see from the “Dragging Rules” on page 117, if pkg1 depends on pkg2, it can sometimes be a good idea to assign a higher priority to pkg1, because that provides the best chance for a successful failover (and failback) if pkg1 fails. But you also need to weigh the relative importance of the packages.
PAGE 122
Planning and Documenting an HA Cluster Package Configuration Planning About External Scripts As of Serviceguard A.11.18, the package configuration template for modular packages explicitly provides for external scripts. These replace the CUSTOMER DEFINED FUNCTIONS in legacy scripts. Each external script must have three entry points: start, stop, and validate, and should exit with one of the following values: NOTE • 0 - indicating success.
PAGE 123
Planning and Documenting an HA Cluster Package Configuration Planning The scripts are also run when the package is validated by cmcheckconf and cmapplyconf. A package can make use of both kinds of script, and can launch more than one of each kind; in that case the scripts will be executed in the order they are listed in the package configuration file (and in the reverse order when the package shuts down). A sample script follows. It assumes there is another script called monitor.
PAGE 124
Planning and Documenting an HA Cluster Package Configuration Planning #!/bin/sh # Source utility functions. if [[ -z $SG_UTILS ]] then . $SGCONF.conf SG_UTILS=$SGCONF/scripts/mscripts/utils.sh fi if [[ -f ${SG_UTILS} ]]; then .
PAGE 125
Planning and Documenting an HA Cluster Package Configuration Planning found=1 break ;; *) ;; esac (( i = i + 1 )) done if (( found == 0 )) then sg_log 0 "ERROR: monitoring service not configured!" ret=1 fi if (( ret == 1 )) then sg_log 0 "Script validation for $SG_PACKAGE_NAME failed!" fi return $ret } function start_command { sg_log 5 "start_command" # log current PEV_MONITORING_INTERVAL value, PEV_ attribute can be changed # while the package is running sg_log 0 "PEV_MONITORING_INTERVAL for $SG_PACKAGE_NA
PAGE 126
Planning and Documenting an HA Cluster Package Configuration Planning ;; stop) stop_command $* exit_val=$? ;; validate) validate_command $* exit_val=$? ;; *) sg_log 0 "Unknown entry point $1" ;; esac exit $exit_val For more information about integrating an application with Serviceguard, see the white paper Framework for HP Serviceguard Toolkits, which includes a suite of customizable scripts.
PAGE 127
Planning and Documenting an HA Cluster Package Configuration Planning Package Configuration File Parameters Before editing the package configuration file, assemble the following information and enter it on the worksheet for each package. For more details about each parameter, see the “Package Parameter Explanations” on page 201; you may also want to generate and print out a complete package configuration file; use a command such as: cmmakepkg -m sg/all $SGCONF/sg-all package_name The name of the package.
PAGE 128
Planning and Documenting an HA Cluster Package Configuration Planning A node name can be up to 39 characters (bytes) long; more details in Chapter 6 under node_name (see page 202). auto_run If auto_run is set to yes, Serviceguard will automatically start the package on an eligible node if one is available, and will automatically fail the package over to another node if it fails.
PAGE 129
Planning and Documenting an HA Cluster Package Configuration Planning The default is no. run_script_timeout and halt_script_timeout If you specify a timeout value and the script does not complete in that time, Serviceguard will terminate the script. Timeout values are in seconds. Can be 0 through 4292, or no_timeout. More details in Chapter 6, under run_script_timeout (see page 203).
PAGE 130
Planning and Documenting an HA Cluster Package Configuration Planning The alternative policy is min_package_node, which tells the package manager to select (from the list of nodes that can run this package) the node that is running the fewest packages. See also “About Package Dependencies” on page 115.
PAGE 131
Planning and Documenting an HA Cluster Package Configuration Planning (Package dependency parameters) dependency_name - A unique identifier for the dependency dependency_condition - pkgname = up dependency_location - same_node See “About Package Dependencies” on page 115 for more information. monitored_subnet Specify the IP subnets that are to be monitored for the package. ip_subnet and ip_address Specify an IP subnet and relocatable IP addresses used by the package.
PAGE 132
Planning and Documenting an HA Cluster Package Configuration Planning service_fail_fast_enabled For each service, indicates whether or not the failure of the service will result in the failure of the node. Valid values are yes or no. If the parameter is set to yes, and the service fails, Serviceguard will halt the node on which the service is running (system halt). The default is no.
PAGE 133
Planning and Documenting an HA Cluster Package Configuration Planning fs_mount_retry_count The number of mount retries for each file system. The default is zero. Details in Chapter 6 under “fs_mount_retry_count” on page 212. fs_umount_retry_count The number of unmount retries allowed for each file system during package shutdown. The default is zero.
PAGE 134
Planning and Documenting an HA Cluster Package Configuration Planning For these filesystem types, a logical volume must be built on an LVM volume group. Logical volumes can be entered in any order, regardless of the type of storage group that is used. A gfs file system can be configured using only the fs_name, fs_directory, and fs_mount_opt parameters; see the configuration file for an example. Additional rules apply for gfs as explained in Chapter 6 under fs_type (see page 213).
PAGE 135
Planning and Documenting an HA Cluster Package Configuration Planning (Access Control Policies) You can configure package administration access for this package. Be sure the policy is not in conflict with or redundant to an access policy defined in the cluster configuration file. For more information, see “Editing Security Files” on page 142. Package Configuration Worksheet Assemble your package configuration data in a separate worksheet for each package, as shown in the following example.
PAGE 136
Planning and Documenting an HA Cluster Package Configuration Planning Package Configuration File Data: ========================================================================== Package Name: ______pkg11____________Package Type:___Failover____________ Primary Node: ______ftsys9_______________ First Failover Node:____ftsys10_______________ Additional Failover Nodes:__________________________________ Run Script Timeout: _no_timeout_____ Halt Script Timeout: _no_timeout___ Package AutoRun Enabled? Node Failfas
PAGE 137
Planning and Documenting an HA Cluster Package Configuration Planning fs_name____________________fs_directory________________fs_mount_opt____________ fs_umount_opt_____________ fs_fsck_opt_________________fs_type_________________ fs_mount_retry_count: ____________fs_umount_retry_count:___________________ Concurrent mount/umount operations: ______________________________________ Concurrent fsck operations: ______________________________________________ ========================================================
PAGE 138
Planning and Documenting an HA Cluster Package Configuration Planning Additional Parameters Used Only by Legacy Packages IMPORTANT The following parameters are used only by legacy packages. Do not try to use them in modular packages. See“Creating the Legacy Package Configuration” on page 262 for more information. PATH Specifies the path to be used by the script. SUBNET Specifies the IP subnets that are to be monitored for the package. RUN_SCRIPT and HALT_SCRIPT Use the full pathname of each script.
PAGE 139
Building an HA Cluster Configuration 5 Building an HA Cluster Configuration This chapter and the next take you through the configuration tasks required to set up a Serviceguard cluster. These procedures are carried out on one node, called the configuration node, and the resulting binary file is distributed by Serviceguard to all the nodes in the cluster. In the examples in this chapter, the configuration node is named ftsys9, and the sample target node is called ftsys10.
PAGE 140
Building an HA Cluster Configuration Preparing Your Systems Preparing Your Systems Before configuring your cluster, ensure that Serviceguard is installed on all cluster nodes, and that all nodes have the appropriate security files, kernel configuration and NTP (network time protocol) configuration. Understanding the Location of Serviceguard Files Serviceguard uses a special file, /etc/cmcluster.conf, to define the locations for configuration and log files within the Linux file system.
PAGE 141
Building an HA Cluster Configuration Preparing Your Systems file. For example, if SGCONF is /usr/local/cmcluster/conf, then the complete pathname for file $SGCONF/cmclconfig would be /usr/local/cmcluster/conf/cmclconfig. Enabling Serviceguard Command Access To allow the creation of a Serviceguard configuration, you should complete the following steps on all cluster nodes before running any Serviceguard commands: 1. Make sure the root user’s path includes the Serviceguard executables.
PAGE 142
Building an HA Cluster Configuration Preparing Your Systems Editing Security Files Serviceguard daemons grant access to commands by matching incoming hostname and username against defined access control policies. To understand how to properly configure these policies, administrators need to understand how Serviceguard handles hostnames, IP addresses, usernames and the relevant configuration files. For redundancy, Serviceguard utilizes all available IPv4 networks for communication.
PAGE 143
Building an HA Cluster Configuration Preparing Your Systems 10.8.1.132 sly.uksr.hp.com sly 15.145.162.150 bit.uksr.hp.com bit NOTE Serviceguard will only recognize the hostname in a fully qualified domain name (FQDN). For example, two nodes gryf.uksr.hp.com and gryf.cup.hp.com could not be in the same cluster as they would both be treated as the same host gryf. Serviceguard also supports domain name aliases.
PAGE 144
Building an HA Cluster Configuration Preparing Your Systems Username Validation Serviceguard relies on the identd daemon (usually started from /etc/init.d/xinetd) to verify the username of the incoming network connection. If the Serviceguard daemon is unable to connect to the identd daemon, permission will be denied. For Serviceguard to recognize a remote user as the root user on that remote node, identd must return the username root.
PAGE 145
Building an HA Cluster Configuration Preparing Your Systems server_args = -f /user/local/cmom/log/cmomd.log -r /user/local/cmom/run to server_args = -i -f /user/local/cmom/log/cmomd.log -r /user/local/cmom/run 3. Restart xinetd: /etc/init.d/xinetd restart Access Roles Serviceguard access control policies define what a user on a remote node can do on the local node. These are known as Access Roles or Role Based Access (RBA). This manual uses Access Roles.
PAGE 146
Building an HA Cluster Configuration Preparing Your Systems — Full Admin: These users can administer the cluster. They users can issue these commands in their cluster: cmruncl, cmhaltcl, cmrunnode, and cmhaltnode. Full Admins can not configure or create a cluster. Full Admin includes the privileges of the Package Admin role. NOTE When you upgrade a cluster from Version A.11.
PAGE 147
Building an HA Cluster Configuration Preparing Your Systems Using the cmclnodelist File The cmclnodelist file is not created by default in new installations. When you create it, you may want to add a comment such as the following at the top of the file: ########################################################### # Do not edit this file! # Serviceguard uses this file only to authorize access to an # unconfigured node. Once the node is configured, # Serviceguard will not consult this file.
PAGE 148
Building an HA Cluster Configuration Preparing Your Systems NOTE Users on systems outside the cluster cannot gain root access to cluster nodes. Define access control policies for a cluster in the cluster configuration file, and for a specific package in the package configuration file. Any combination of hosts and users can be assigned roles for the cluster. You can define up to 200 access policies for each cluster.
PAGE 149
Building an HA Cluster Configuration Preparing Your Systems NOTE You do not have to halt the cluster or package to configure or modify access control policies. Here is an example of an access control policy: USER_NAME john USER_HOST bit USER_ROLE PACKAGE_ADMIN If this policy is defined in the package configuration for PackageA, then user john from node bit has the PACKAGE_ADMIN role only for PackageA. User john also has the MONITOR role for the entire cluster.
PAGE 150
Building an HA Cluster Configuration Preparing Your Systems Plan the cluster’s roles and validate them as soon as possible. If your organization’s security policies allow it, you may find it easiest to create group logins. For example, you could create a MONITOR role for user operator1 from ANY_CLUSTER_NODE. Then you could give this login name and password to everyone who will need to monitor your clusters.
PAGE 151
Building an HA Cluster Configuration Setting up the Quorum Server Setting up the Quorum Server The quorum server software, which has to be running during cluster configuration, must be installed on a system other than the nodes on which your cluster will be running. • Ideally the Quorum Server and the cluster or clusters it serves should communicate over a subnet that does not handle other traffic. This helps to ensure that the Quorum Server is available when it is needed.
PAGE 152
Building an HA Cluster Configuration Setting up the Quorum Server Installing the Quorum Server Use the Linux rpm command to install the quorum server, product number B8467BA, on the system or systems where it will be running. You do not need to install the product on nodes that are simply using quorum services. More details on installation are in the Quorum Server Release Notes for your version of Quorum Server.
PAGE 153
Building an HA Cluster Configuration Setting up the Quorum Server Running the Quorum Server The quorum server must be running when you use cmquerycl or cmapplyconf. By default, quorum server run-time messages go to stdout and stderr. It is suggested that you capture these messages by redirecting stdout and stderr to the file /var/log/qs/qs.log. You must have root permission to execute the quorum server.
PAGE 154
Building an HA Cluster Configuration Setting up the Lock LUN Setting up the Lock LUN The lock LUN requires a partition of one cylinder of at least 100K defined (via the fdisk command) as type Linux (83). You will need the pathnames for the lock LUN as it is seen on each cluster node. On one node, use the fdisk command to define a partition of 1 cylinder, type 83, on this LUN.
PAGE 155
Building an HA Cluster Configuration Setting up the Lock LUN Command (m for help): p Disk /dev/sdc: 64 heads, 32 sectors, 4067 cylinders Units = cylinders of 2048 * 512 bytes Device Boot /dev/sdc1 Start 1 End 1 Blocks 1008 Id 83 System Linux Command (m for help): w The partition table has been altered! NOTE • Do not try to use LVM to configure the lock LUN. • The partition type must be 83. • Do not create any filesystem on the partition used for the lock LUN.
PAGE 156
Building an HA Cluster Configuration Implementing Channel Bonding (Red Hat) Implementing Channel Bonding (Red Hat) This section applies to Red Hat installations. If you are using a SUSE distribution, skip ahead to the next section. Channel bonding of LAN interfaces is implemented by the use of the bonding driver, which is installed in the kernel at boot time.
PAGE 157
Building an HA Cluster Configuration Implementing Channel Bonding (Red Hat) Sample Configuration Configure the following files to support LAN redundancy. For a single failover only one bond is needed. 1. Create a bond0 file, ifcfg-bond0. Create the configuration in the /etc/sysconfig/network-scripts directory. For example, in the file, ifcfg-bond0, bond0 is defined as the master (for your installation, substitute the appropriate values for your network instead of 192.168.1).
PAGE 158
Building an HA Cluster Configuration Implementing Channel Bonding (Red Hat) Use MASTER=bond1 for bond1 if you have configured a second bonding interface, then add the following after the first bond (bond0): options bond1 -o bonding1 miimon=100 mode=1 NOTE During configuration, you need to make sure that the active slaves for the same bond on each node are connected the same hub or switch. You can check on this by examining the file /proc/net/bondx/info on each node.
PAGE 159
Building an HA Cluster Configuration Implementing Channel Bonding (Red Hat) Viewing the Configuration You can test the configuration and transmit policy with ifconfig. For the example, execution on the above created configuration, the display should appear like this: /sbin/ifconfig bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.
PAGE 160
Building an HA Cluster Configuration Implementing Channel Bonding (SUSE) Implementing Channel Bonding (SUSE) If you are using a Red Hat distribution, use the procedures described in the previous section. The following applies only to the SUSE distributions. First run yast/yast2 and configure ethernet devices as DHCP so they create the ifcfg-eth-id- files.
PAGE 161
Building an HA Cluster Configuration Implementing Channel Bonding (SUSE) REMOTE_IPADDR='' STARTMODE='onboot' BONDING_MASTER='yes' BONDING_MODULE_OPTS='miimon=100 mode=1' BONDING_SLAVE0='eth0' BONDING_SLAVE1='eth1' The above example configures bond0 with mii monitor equal to 100 and mode active-backup. Adjust the IP, BROADCAST, NETMASK, NETWORK accordingly for your configuration. The new config options as you can see are BONDING_MASTER, BONDING-MODULE_OPTS, BONDING_SLAVE.
PAGE 162
Building an HA Cluster Configuration Implementing Channel Bonding (SUSE) NOTE It is better not to restart the network from outside the cluster subnet, as there is a chance the network could go down before the command can complete. If there was an error in any of the bonding configuration files, the network might not function properly. If this occurs, check each configuration file for errors, then try to restart the network again.
PAGE 163
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Creating the Logical Volume Infrastructure Serviceguard makes use of shared disk storage. This is set up to provide high availability by using redundant data storage and redundant paths to the shared devices. Storage for a Serviceguard package is logically composed of LVM Volume Groups that are activated on a node as part of starting a package on that node. Storage is generally configured on logical units (LUNs).
PAGE 164
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure CAUTION The minor numbers used by the LVM volume groups must be the same on all cluster nodes. This means that if there are any non-shared volume groups in the cluster, create the same number of them on all nodes, and create them before you define the shared storage. NOTE Except as noted in the sections that follow, you perform the LVM configuration of shared storage on only one node.
PAGE 165
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Displaying Disk Information To display a list of configured disks, use the following command: fdisk -l You will see output such as the following: Disk /dev/sda: 64 heads, 32 sectors, 8678 cylinders Units = cylinders of 2048 * 512 bytes Device Boot /dev/sda1 * /dev/sda2 /dev/sda5 /dev/sda6 /dev/sda7 Start 1 1002 1002 4003 5004 End 1001 8678 4002 5003 8678 Blocks 1025008 7861248 3073008 1025008 3763184 Id 83 5 83 82 83 Syst
PAGE 166
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure 1. Run fdisk, specifying your device file name in place of : # fdisk Respond to the prompts as shown in the following table, to define a partition: Prompt Response Action Performed 1. Command (m for help): n Create a new partition 2. Command action e extended p primary partition (1-4) p Creation a primary partition 3. Partition number (1-4): 1 Create partition 1 4.
PAGE 167
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Disk /dev/sdc: 64 heads, 32 sectors, 4067 cylinders Units = cylinders of 2048 * 512 bytes Device Boot /dev/sdc1 Start 1 End Blocks Id System 4067 4164592 83 Linux Command (m for help): w The partition table has been altered! 2. Respond to the prompts as shown in the following table to set a partition type: Prompt Response Action Performed 1. Command (m for help): t Set the partition type 2.
PAGE 168
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Command (m for help): w The partition table has been altered! 3. Repeat this process for each device file that you will use for shared storage. fdisk /dev/sdd fdisk /dev/sdf fdisk /dev/sdg 4. If you will be creating volume groups for internal storage, make sure to create those partitions as well, and create those volume groups before you define the shared storage.
PAGE 169
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure NOTE At this point, the setup for VG activation protection is complete. As of Serviceguard for Linux A.11.16.07, the package control script adds a tag matching the value of node (as specified in step 3 above) when it activates the volume group, and deletes the tag when it deactivates it, preventing the volume group from being activated by more than one node at the same time.
PAGE 170
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure 1. Update the LVM configuration and create the /etc/lvmtab file. You can omit this step if you have previously created volume groups on this node. vgscan NOTE The files /etc/lvmtab and /etc/lvmtab.d may not exist on some distributions. In that case, ignore references to these files. 2. Create LVM physical volumes on each LUN.
PAGE 171
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Figure 5-1 shows these two volume groups as they are constructed for the MSA500 Storage.
PAGE 172
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Building Volume Groups and Logical Volumes Step 1. Use Logical Volume Manager (LVM) to create volume groups that can be activated by Serviceguard packages. For an example showing volume-group creation on LUNs, see “Building Volume Groups: Example for Smart Array Cluster Storage (MSA 500 Series)” on page 169.
PAGE 173
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure NOTE Be careful if you use YAST or YAST2 to configure volume groups, as that may cause all volume groups on that system to be activated. After running YAST or YAST2, check to make sure that volume groups for Serviceguard packages not currently running have not been activated, and use LVM commands to deactivate any that have. For example, use the command vgchange -a n /dev/sgvg00 to deactivate the volume group sgvg00.
PAGE 174
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure reboot The partition table on the rebooted node is then rebuilt using the information placed on the disks when they were partitioned on the other node. NOTE You must reboot at this time. 3. Run vgscan to make the LVM configuration visible on the new node and to create the LVM database on /etc/lvmtab and /etc/lvmtab.d.
PAGE 175
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Step 2.
PAGE 176
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Storing Volume Group Configuration Data When you create volume groups, LVM creates a backup copy of the volume group configuration on the configuration node.
PAGE 177
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure vgchange -a n /dev/sgvg00 vgchange -a n /dev/sgvg01 The vgchange commands activate the volume groups temporarily, then deactivate them; this is expected behavior.
PAGE 178
Building an HA Cluster Configuration Creating the Logical Volume Infrastructure Setting up Disk Monitoring HP Serviceguard for Linux includes a Disk Monitor which you can use to detect problems in disk connectivity. This lets you fail a package from one node to another in the event of a disk link failure. See “Creating a Disk Monitor Configuration” on page 228 for instructions on configuring disk monitoring.
PAGE 179
Building an HA Cluster Configuration Configuring the Cluster Configuring the Cluster This section describes how to define the basic cluster configuration. This must be done on a system that is not part of a Serviceguard cluster (that is, on which Serviceguard is installed but not configured). You can do this in Serviceguard Manager, or from the command line as described below.
PAGE 180
Building an HA Cluster Configuration Configuring the Cluster For more details, see the cmquerycl(1m) man page. The man page for the cmquerycl command lists the definitions of all the parameters that appear in this file. Many are also described in Chapter 4, “Planning and Documenting an HA Cluster,” on page 89. Modify your $SGCONF/clust1.config file to your requirements, using the data on the cluster configuration worksheet. In the file, keywords are separated from definitions by white space.
PAGE 181
Building an HA Cluster Configuration Configuring the Cluster Specifying Maximum Number of Configured Packages This value must be equal to or greater than the number of packages currently configured in the cluster. The maximum number of packages per cluster is 150. The default is the maximum.
PAGE 182
Building an HA Cluster Configuration Configuring the Cluster Modifying Cluster Timing Parameters The cmquerycl command supplies default cluster timing parameters for HEARTBEAT_INTERVAL and NODE_TIMEOUT. Changing these parameters will directly impact the cluster’s reformation and failover times. It is useful to modify these parameters if the cluster is re-forming occasionally due to heavy system load or heavy network traffic.
PAGE 183
Building an HA Cluster Configuration Configuring the Cluster Verifying the Cluster Configuration If you have edited a cluster configuration template file, use the following command to verify the content of the file: cmcheckconf -v -C $SGCONF/clust1.config This command checks the following: Chapter 5 • Network addresses and connections. • Quorum server connection. • All lock LUN device names on all nodes refer to the same physical disk area. • One and only one lock LUN device is specified per node.
PAGE 184
Building an HA Cluster Configuration Configuring the Cluster • The network interface device files specified are valid LAN device files. • Other configuration parameters for the cluster and packages are valid. If the cluster is online the cmcheckconf command also verifies that all the conditions for the specific change in configuration have been met. Cluster Lock Configuration Messages The cmquerycl, cmcheckconf and cmapplyconf commands will return errors if the cluster lock is not correctly configured.
PAGE 185
Building an HA Cluster Configuration Managing the Running Cluster Managing the Running Cluster This section describes some approaches to routine management of the cluster. Additional tools and suggestions are found in Chapter 7, “Cluster and Package Maintenance,” on page 229. You can manage the cluster from Serviceguard Manager, or by means of Serviceguard commands as described below. Checking Cluster Operation with Serviceguard Commands • cmviewcl checks status of the cluster and many of its components.
PAGE 186
Building an HA Cluster Configuration Managing the Running Cluster 2. When the cluster has started, make sure that cluster components are operating correctly: cmviewcl -v Make sure that all nodes and networks are functioning as expected. For more information, refer to the chapter on “Cluster and Package Maintenance.” 3. Verify that nodes leave and enter the cluster as expected using the following steps: • Halt the cluster. You can use Serviceguard Manager or the cmhaltnode command.
PAGE 187
Building an HA Cluster Configuration Managing the Running Cluster Setting up Autostart Features Automatic startup is the process in which each node individually joins a cluster; Serviceguard provides a startup script to control the startup process. If a cluster already exists, the node attempts to join it; if no cluster is running, the node attempts to form a cluster consisting of all configured nodes. Automatic cluster start is the preferred way to start a cluster.
PAGE 188
Building an HA Cluster Configuration Managing the Running Cluster Changing the System Message You may find it useful to modify the system's login message to include a statement such as the following: This system is a node in a high availability cluster. Halting this system may cause applications and services to start up on another node in the cluster. You might wish to include a list of all cluster nodes in this message, together with additional cluster-specific information.
PAGE 189
Building an HA Cluster Configuration Managing the Running Cluster currently available for package switching. However, you should not try to restart HP Serviceguard, since data corruption might occur if another node were to attempt to start up a new instance of the application that is still running on the single node.
PAGE 190
Building an HA Cluster Configuration Managing the Running Cluster 190 Chapter 5
PAGE 191
Configuring Packages and Their Services 6 Configuring Packages and Their Services Serviceguard packages group together applications and the services and resources they depend on. The typical Serviceguard package is a failover package that starts on one node but can be moved (“failed over”) to another if necessary. See “What is Serviceguard for Linux?” on page 18, “How the Package Manager Works” on page 46, and“Package Configuration Planning” on page 113 for more information.
PAGE 192
Configuring Packages and Their Services Packages created using Serviceguard A.11.16 or earlier are referred to as legacy packages. If you need to reconfigure a legacy package (rather than create a new package), see “Configuring a Legacy Package” on page 262. It is also still possible to create new legacy packages by the method described in “Configuring a Legacy Package”. If you are using a Serviceguard Toolkit, consult the documentation for that product.
PAGE 193
Configuring Packages and Their Services Choosing Package Modules Choosing Package Modules IMPORTANT Before you start, you need to do the package-planning tasks described under “Package Configuration Planning” on page 113. To choose the right package modules, you need to decide the following things about the package you are creating: • What type of package it is; see “Types of Package: Failover, Multi-Node, System Multi-Node” on page 193.
PAGE 194
Configuring Packages and Their Services Choosing Package Modules Relocatable IP addresses cannot be assigned to multi-node packages. Multi-node packages must either use a clustered file system such as Red Hat GFS, or not use shared storage. IMPORTANT To generate a package configuration file that creates a multi-node package, include -m sg/multi_node on the cmmakepkg command line. See “Generating the Package Configuration File” on page 217. • NOTE System multi-node packages.
PAGE 195
Configuring Packages and Their Services Choosing Package Modules Package Modules and Parameters The table that follows shows the package modules and the configuration parameters each module includes. Read this section in conjunction with the discussion under “Package Configuration Planning” on page 113.
PAGE 196
Configuring Packages and Their Services Choosing Package Modules Base Package Modules At least one base module (or default or all, which include the base module) must be specified on the cmmakepkg command line. Parameters marked with an asterisk (*) are new or changed as of Serviceguard A.11.18. (S) indicates that the parameter (or its equivalent) has moved from the package control script to the package configuration file for modular packages.
PAGE 197
Configuring Packages and Their Services Choosing Package Modules Table 6-1 Base Modules (Continued) Module Name Parameters (page) Comments multi_node package_name (201) * module_name (202) * module_version (202) * package_type (202) node_name (202) auto_run (202) node_fail_fast_enabled (203) run_script_timeout (203) halt_script_timeout (203) successor_halt_timeout (204) * script_log_file (204) operation_sequence (204) * log_level (205) * priority (206) * Base module.
PAGE 198
Configuring Packages and Their Services Choosing Package Modules Optional Package Modules Add optional modules to a base module if you need to configure the functions in question. Parameters marked with an asterisk (*) are new or changed as of Serviceguard A.11.18. (S) indicates that the parameter (or its equivalent) has moved from the package control script to the package configuration file for modular packages. See the “Package Parameter Explanations” on page 201 for more information.
PAGE 199
Configuring Packages and Their Services Choosing Package Modules Table 6-2 Module Name Optional Modules (Continued) Parameters (page) Comments filesystem concurrent_fsck_operations (211) (S) concurrent_mount_and_umount_ operations (212) (S) fs_mount_retry_count (212) (S) fs_umount_retry_count (212) * (S) fs_name (213) * (S) fs_directory (213) * (S) fs_type (213) (S) fs_mount_opt (214) (S) fs_umount_opt (214) (S) fs_fsck_opt (214) (S) Add to a base module to configure filesystem options for the package.
PAGE 200
Configuring Packages and Their Services Choosing Package Modules Table 6-2 Module Name Optional Modules (Continued) Parameters (page) Comments all all parameters Use if you are creating a complex package that requires most or all of the optional parameters; or if you want to see the specifications and comments for all available parameters.
PAGE 201
Configuring Packages and Their Services Choosing Package Modules Package Parameter Explanations Brief descriptions of the package configuration parameters follow. NOTE For more information, see the comments in the editable configuration file output by the cmmakepkg command, and the cmmakepkg (1m) manpage.
PAGE 202
Configuring Packages and Their Services Choosing Package Modules module_name The module name. Do not change it. Used in the form of a relative path (for example sg/failover) as a parameter to cmmakepkg to specify modules to be used in configuring the package. (The files reside in the $SGCONF/modules directory; see “Understanding the Location of Serviceguard Files” on page 140 for the location of $SGCONF on your version of Linux.) New for modular packages. module_version The module version. Do not change it.
PAGE 203
Configuring Packages and Their Services Choosing Package Modules For failover packages, yes allows Serviceguard to start the package (on the first available node listed under node_name) on cluster start-up, and to automatically restart it on an adoptive node if it fails. no prevents Serviceguard from automatically starting the package, and from restarting it on another node.
PAGE 204
Configuring Packages and Their Services Choosing Package Modules If the package’s halt process does not complete in the time specified by halt_script_timeout, Serviceguard will terminate the package and prevent it from switching to another node. In this case, if node_fail_fast_enabled (see page 203) is set to yes, the node will be halted (reboot). If a timeout occurs: • • Switching will be disabled. The current node will be disabled from running the package.
PAGE 205
Configuring Packages and Their Services Choosing Package Modules This parameter is not configurable; do not change the entries in the configuration file. New for modular packages. log_level Determines the amount of information printed to stdout when the package is validated, and to the script_log_file (see page 204) when the package is started and halted.
PAGE 206
Configuring Packages and Their Services Choosing Package Modules • automatic means Serviceguard will move the package to the primary node as soon as that node becomes available, unless doing so would also force a package with a higher priority (see page 206) to move. This parameter can be set for failover packages only. priority Assigns a priority to a failover package whose failover_policy (see page 205) is configured_node. Valid values are 1 through 3000, or no_priority. The default is no_priority.
PAGE 207
Configuring Packages and Their Services Choosing Package Modules Configure this parameter, along with dependency_condition and dependency_location (see page 207), and optionally priority, if this package depends on another package; for example, if this package depends on a package named pkg2: dependency_name pkg2dep dependency_condition pkg2 = UP dependency_location same_node For more information about package dependencies, see “About Package Dependencies” on page 115.
PAGE 208
Configuring Packages and Their Services Choosing Package Modules monitored_subnet The LAN subnet that is to be monitored for this package. Replaces legacy SUBNET which is still supported in the package configuration file for legacy packages; see “Configuring a Legacy Package” on page 262. Multiple subnets can be specified on separate lines. Specifying a subnet as a monitored_subnet means that the package will not run if the subnet is not up, and will not run on any node not reachable via that subnet.
PAGE 209
Configuring Packages and Their Services Choosing Package Modules ip_address A relocatable IP address on a specified ip_subnet (see above). Replaces IP, which is still supported in the package control script for legacy packages. For more information about relocatable IP addresses, see “Stationary and Relocatable IP Addresses and Monitored Subnets” on page 71. This parameter can be set for failover packages only.
PAGE 210
Configuring Packages and Their Services Choosing Package Modules service_cmd The command that runs the program or function for this service_name, for example, /usr/bin/X11/xclock -display 15.244.58.208:0 An absolute pathname is required; neither the PATH variable nor any other environment variable is passed to the command. The default shell is /bin/sh. NOTE Be careful when defining service run commands. Each run command is executed in the following way: • The cmrunserv command executes the run command.
PAGE 211
Configuring Packages and Their Services Choosing Package Modules service_fail_fast_enabled Specifies whether or not Serviceguard will halt the node (reboot) on which the package is running if the service identified by service_name fails. Valid values are yes and no. Default is no, meaning that failure of this service will not cause the node to halt. service_halt_timeout The length of time, in seconds, Serviceguard will wait for the service to halt before forcing termination of the service’s process.
PAGE 212
Configuring Packages and Their Services Choosing Package Modules If the package needs to run fsck on a large number of file systems, you can improve performance by carefully tuning this parameter during testing (increase it a little at time and monitor performance each time). concurrent_mount_and_umount_operations The number of concurrent mounts and umounts to allow during package startup or shutdown. Legal value is any number greater than zero. The default is 1.
PAGE 213
Configuring Packages and Their Services Choosing Package Modules fs_name This parameter, in conjunction with fs_directory, fs_type, fs_mount_opt, fs_umount_opt, and fs_fsck_opt, specifies a filesystem that is to be mounted by the package. fs_name must specify the block devicefile for a logical volume. Replaces LV, which is still supported in the package control script for legacy packages. File systems are mounted in the order you specify in the package configuration file, and unmounted in the reverse order.
PAGE 214
Configuring Packages and Their Services Choosing Package Modules See also concurrent_fsck_operations on page 211, fs_mount_retry_count and fs_umount_retry_count on page 212 and fs_fsck_opt on page 214. See the comments in the configuration file for more information. fs_mount_opt The mount options for the file system specified by fs_name. See the comments in the configuration file for more information. This parameter is in the package control script for legacy packages.
PAGE 215
Configuring Packages and Their Services Choosing Package Modules external_pre_script The full pathname of an external script to be executed before volume groups and disk groups are activated during package startup, and after they have been deactivated during package shutdown; that is, effectively the first step in package startup and last step in package shutdown. New for modular packages.
PAGE 216
Configuring Packages and Their Services Choosing Package Modules user_name Specifies the name of a user who has permission to administer this package. See also user_host and user_role; these three parameters together define the Access Control Policy for this package (see “Access Roles” on page 145). These parameters must be defined in this order: user_name, user_host, user_role.
PAGE 217
Configuring Packages and Their Services Generating the Package Configuration File Generating the Package Configuration File When you have chosen the configuration modules your package needs (see “Choosing Package Modules” on page 193), you are ready to generate a package configuration file that contains those modules. This file will consist of a base module (failover, multi-node or system multi-node) plus the modules that contain the additional parameters you have decided to include.
PAGE 218
Configuring Packages and Their Services Generating the Package Configuration File • To generate a configuration file that contains all the optional modules: cmmakepkg $SGCONF/pkg1/pkg1.conf • To create a generic failover package (that could be applied without editing): cmmakepkg -n pkg1 -m sg/failover $SGCONF/pkg1/pkg1.
PAGE 219
Configuring Packages and Their Services Editing the Configuration File Editing the Configuration File When you have generated the configuration file that contains the modules your package needs (see “Generating the Package Configuration File” on page 217), you need to edit the file to set the package parameters to the values that will make the package function as you intend. It is a good idea to configure complex failover packages in stages, as follows: 1. Configure volume groups and mount points only. 2.
PAGE 220
Configuring Packages and Their Services Editing the Configuration File the surrounding comments in the file, and the explanations in this chapter, to make sure you understand the implications both of accepting and of changing a given default. In all cases, be careful to uncomment each parameter you intend to use and assign it the value you want it to have. • package_name. Enter a unique name for this package. Note that there are stricter formal requirements for the name as of A.11.18. • package_type.
PAGE 221
Configuring Packages and Their Services Editing the Configuration File • script_log_file. You can specify a place for the run and halt script to place log messages. If you do not specify a path, Serviceguard will create a file with .log appended to each script path, and put the messages in that file. • log_level. See log_level on page 205. • failover_policy. Enter configured_node if you want Serviceguard to attempt to start the package on the first node (as listed under node_name).
PAGE 222
Configuring Packages and Their Services Editing the Configuration File • If your package will use relocatable IP addresses, enter the ip_subnet and ip_address addresses. ip_subnet must be a subnet that is already specified in the cluster configuration.
PAGE 223
Configuring Packages and Their Services Editing the Configuration File • If your package uses a large number of volume groups or disk groups, or mounts a large number of file systems, consider increasing the values of the following parameters: — concurrent_fsck_operations—specifies the number of parallel fsck operations that will be allowed at package startup (not used for Red Hat GFS).
PAGE 224
Configuring Packages and Their Services Editing the Configuration File The only user role you can configure in the package configuration file is package_admin for the package in question. Cluster-wide roles are defined in the cluster configuration file. See “Access Roles” on page 145 for more information.
PAGE 225
Configuring Packages and Their Services Verifying and Applying the Package Configuration Verifying and Applying the Package Configuration Serviceguard checks the configuration you enter and reports any errors. Use a command such as the following to verify the content of the package configuration file you have created, for example: cmcheckconf -v -P $SGCONF/pkg1/pkg1.config Errors are displayed on the standard output.
PAGE 226
Configuring Packages and Their Services Verifying and Applying the Package Configuration But, if you are accustomed to configuring legacy packages, note that you do not have to create a separate package control script for a modular package, or distribute it manually. (You do still have to do this for legacy packages; see “Configuring a Legacy Package” on page 262.
PAGE 227
Configuring Packages and Their Services Adding the Package to the Cluster Adding the Package to the Cluster You can add the new package to the cluster while the cluster is running, subject to the value of max_configured_packages in the cluster configuration file. See “Adding a Package to a Running Cluster” on page 274.
PAGE 228
Configuring Packages and Their Services Creating a Disk Monitor Configuration Creating a Disk Monitor Configuration Serviceguard provides disk monitoring for the shared storage that is activated by packages in the cluster. The monitor daemon on each node tracks the status of all the disks on that node that you have configured for monitoring.
PAGE 229
Cluster and Package Maintenance 7 Cluster and Package Maintenance This chapter describes the cmviewcl command, then shows how to start and halt a cluster or an individual node, how to perform permanent reconfiguration, and how to start, halt, move, and modify packages during routine maintenance of the cluster.
PAGE 230
Cluster and Package Maintenance Reviewing Cluster and Package Status Reviewing Cluster and Package Status You can check status using Serviceguard Manager, or from a cluster node’s command line. Reviewing Cluster and Package Status with the cmviewcl Command Information about cluster status is stored in the status database, which is maintained on each individual node in the cluster.
PAGE 231
Cluster and Package Maintenance Reviewing Cluster and Package Status Cluster Status The status of a cluster, as shown by cmviewcl, can be one of the following: • up - At least one node has a running cluster daemon, and reconfiguration is not taking place. • down - No cluster daemons are running on any cluster node. • starting - The cluster is in the process of determining its active membership. At least one cluster daemon is running.
PAGE 232
Cluster and Package Maintenance Reviewing Cluster and Package Status Package Status and State The status of a package can be one of the following: • up - The package master control script is active. • down - The package master control script is not active. • start_wait - A cmrunpkg command is in progress for this package. The package is waiting for packages it depends on (predecessors) to start before it can start. • starting - The package is starting. The package master control script is running.
PAGE 233
Cluster and Package Maintenance Reviewing Cluster and Package Status • start_wait - A cmrunpkg command is in progress for this package. The package is waiting for packages it depends on (predecessors) to start before it can start. • running - Services are active and being monitored. • halting - A cmhaltpkg command is in progress for this package and the halt script is running. • halt_wait - A cmhaltpkg command is in progress for this package.
PAGE 234
Cluster and Package Maintenance Reviewing Cluster and Package Status • Switching Enabled for a Node: For failover packages, enabled means that the package can switch to the specified node. disabled means that the package cannot switch to the specified node until the node is enabled to run the package via the cmmodpkg command. Every failover package is marked enabled or disabled for each node that is either a primary or adoptive node for the package.
PAGE 235
Cluster and Package Maintenance Reviewing Cluster and Package Status Failover packages can also be configured with one of two values for the failback_policy parameter (see page 205), and these are also displayed in the output of cmviewcl -v: • automatic: Following a failover, a package returns to its primary node when the primary node becomes available again. • manual: Following a failover, a package will run on the adoptive node until moved back to its original node by a system administrator.
PAGE 236
Cluster and Package Maintenance Reviewing Cluster and Package Status INTERFACE PRIMARY PRIMARY STATUS up up NAME eth0 eth1 PACKAGE pkg2 STATUS up STATE running AUTO_RUN enabled NODE ftsys10 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover configured_node Failback manual Script_Parameters: ITEM STATUS Service up Subnet up MAX_RESTARTS 0 0 Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled RESTARTS 0 0 NAME ftsys10 ftsys9 NAME service2 15.13.168.
PAGE 237
Cluster and Package Maintenance Reviewing Cluster and Package Status Status After Halting a Package After we halt pkg2 with the cmhaltpkg command, the output of cmviewcl-v is as follows: CLUSTER example NODE ftsys9 STATUS up STATUS up STATE running Network_Parameters: INTERFACE STATUS PRIMARY up PRIMARY up NAME eth0 eth1 PACKAGE pkg1 STATE running STATUS up AUTO_RUN enabled NODE ftsys9 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover configured_node Failback manual Script_Parameters: ITEM
PAGE 238
Cluster and Package Maintenance Reviewing Cluster and Package Status Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled NAME ftsys10 ftsys9 pkg2 now has the status down, and it is shown as unowned, with package switching disabled. Note that switching is enabled for both nodes, however. This means that once global switching is re-enabled for the package, it will attempt to start up on the primary node.
PAGE 239
Cluster and Package Maintenance Reviewing Cluster and Package Status Script_Parameters: ITEM STATUS Service up Subnet up NAME Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled NODE ftsys10 STATUS up MAX_RESTARTS RESTARTS 0 service2 0 15.13.168.
PAGE 240
Cluster and Package Maintenance Reviewing Cluster and Package Status Status After Halting a Node After halting ftsys10, with the following command: cmhaltnode ftsys10 the output of cmviewcl is as follows on ftsys9: CLUSTER example NODE ftsys9 PACKAGE pkg1 pkg2 NODE ftsys10 STATUS up STATUS up STATUS up up STATUS down STATE running STATE running running AUTO_RUN enabled enabled NODE ftsys9 ftsys9 STATE halted This output can be seen on both ftsys9 and ftsys10.
PAGE 241
Cluster and Package Maintenance Reviewing Cluster and Package Status Viewing Information about Unowned Packages The following example shows packages that are currently unowned, that is, not running on any configured node.
PAGE 242
Cluster and Package Maintenance Managing the Cluster and Nodes Managing the Cluster and Nodes This section describes the following tasks: • “Starting the Cluster When all Nodes are Down” on page 242 • “Adding Previously Configured Nodes to a Running Cluster” on page 244 • “Removing Nodes from Participation in a Running Cluster” on page 245 • “Halting the Entire Cluster” on page 246 • “Automatically Restarting the Cluster” on page 246 Starting the cluster means running the cluster daemon on one or
PAGE 243
Cluster and Package Maintenance Managing the Cluster and Nodes The -v option produces the most informative output. The following starts all nodes configured in the cluster without a connectivity check: cmruncl -v The -w option causes cmruncl to perform a full check of LAN connectivity among all the nodes of the cluster. Omitting this option will allow the cluster to start more quickly but will not test connectivity.
PAGE 244
Cluster and Package Maintenance Managing the Cluster and Nodes Adding Previously Configured Nodes to a Running Cluster You can use Serviceguard Manager, or HP Serviceguard commands as shown, to bring a configured node up within a running cluster. Use the cmrunnode command to add one or more nodes to an already running cluster. Any node you add must already be a part of the cluster configuration.
PAGE 245
Cluster and Package Maintenance Managing the Cluster and Nodes Removing Nodes from Participation in a Running Cluster You can use Serviceguard Manager, or Serviceguard commands as shown below, to remove nodes from operation in a cluster. This operation removes the node from cluster operation by halting the cluster daemon, but it does not modify the cluster configuration. To remove a node from the cluster configuration permanently, you must recreate the cluster configuration file. See the next section.
PAGE 246
Cluster and Package Maintenance Managing the Cluster and Nodes Halting the Entire Cluster You can use Serviceguard Manager, or Serviceguard commands as shown below, to halt a running cluster. The cmhaltcl command can be used to halt the entire cluster. This command causes all nodes in a configured cluster to halt their HP Serviceguard daemons. You can use the -f option to force the cluster to halt even when packages are running. This command can be issued from any running node.
PAGE 247
Cluster and Package Maintenance Managing Packages and Services Managing Packages and Services This section describes the following tasks: • “Starting a Package” on page 247 • “Halting a Package” on page 248 • “Moving a Failover Package” on page 249 • “Changing Package Switching Behavior” on page 249 Starting a Package Ordinarily, a package configured as part of the cluster will start up on its primary node when the cluster starts up.
PAGE 248
Cluster and Package Maintenance Managing Packages and Services You cannot start a package unless all the packages that it depends on are running. If you try, you’ll see a Serviceguard message telling you why the operation failed, and the package will not start. If this happens, you can repeat the run command, this time including the package(s) this package depends on; Serviceguard will start all the packages in the correct order.
PAGE 249
Cluster and Package Maintenance Managing Packages and Services Moving a Failover Package You can use Serviceguard Manager to move a failover package from one node to another, or Serviceguard commands as shown below. Before you move a failover package to a new node, it is a good idea to run cmviewcl -v -l package and look at dependencies. If the package has dependencies, be sure they can be met on the new node. To move the package, first halt it where it is running using the cmhaltpkg command.
PAGE 250
Cluster and Package Maintenance Managing Packages and Services To temporarily disable switching to other nodes for a running package, use the cmmodpkg command. For example, if pkg1 is currently running, and you want to prevent it from starting up on another node, enter the following: cmmodpkg -d pkg1 This does not halt the package, but will prevent it from starting up elsewhere. You can disable package switching to particular nodes by using the -n option of the cmmodpkg command.
PAGE 251
Cluster and Package Maintenance Reconfiguring a Cluster Reconfiguring a Cluster You can reconfigure a cluster either when it is halted or while it is still running. Some operations can only be done when the cluster is halted. Table 7-1 shows the required cluster state for many kinds of changes. Table 7-1 Types of Changes to the Cluster Configuration Change to the Cluster Configuration Chapter 7 Required Cluster State Add a new node All cluster nodes must be running.
PAGE 252
Cluster and Package Maintenance Reconfiguring a Cluster Table 7-1 Types of Changes to the Cluster Configuration (Continued) Change to the Cluster Configuration Required Cluster State Reconfigure IP addresses for a NIC used by the cluster Must delete the interface from the cluster configuration, reconfigure it, then add it back into the cluster configuration. See “What You Must Keep in Mind” on page 256. Cluster can be running throughout Change NETWORK_POLLING_INTERVAL Cluster can be running.
PAGE 253
Cluster and Package Maintenance Reconfiguring a Cluster Updating the Cluster Lock LUN Configuration Offline The cluster must be halted before you change the lock LUN configuration. Proceed as follows: Step 1. Halt the cluster. Step 2. In the cluster configuration file, modify the values of CLUSTER_LOCK_LUN for each node. Step 3. Run cmcheckconf to check the configuration. Step 4. Run cmapplyconf to apply the configuration. If you need to replace the physical device, see “Replacing a Lock LUN” on page 288.
PAGE 254
Cluster and Package Maintenance Reconfiguring a Cluster 1. Use the following command to store a current copy of the existing cluster configuration in a temporary file in case you need to revert to it: cmgetconf -C temp.ascii 2. Specify a new set of nodes to be configured and generate a template of the new configuration: cmquerycl -C clconfig.ascii -c cluster1 \ -n ftsys8 -n ftsys9 -n ftsys10 3. Edit clconfig.ascii to check the information about the new node. 4.
PAGE 255
Cluster and Package Maintenance Reconfiguring a Cluster NOTE If you want to remove a node from the cluster, run the cmapplyconf command from another node in the same cluster. If you try to issue the command on the node you want removed, you will get an error message. Step 1. Use the following command to store a current copy of the existing cluster configuration in a temporary file: cmgetconf -c cluster1 temp.ascii Step 2.
PAGE 256
Cluster and Package Maintenance Reconfiguring a Cluster Changing the Cluster Networking Configuration while the Cluster Is Running What You Can Do Online operations you can perform include: • Add a network interface and its HEARTBEAT_IP or STATIONARY_IP. • Delete a network interface and its HEARTBEAT_IP or STATIONARY_IP. • Change the designation of an existing interface from HEARTBEAT_IP to STATIONARY_IP, or vice versa. • Change the NETWORK_POLLING_INTERVAL.
PAGE 257
Cluster and Package Maintenance Reconfiguring a Cluster • You cannot delete a subnet or IP address from a node while a package that uses it (as a monitored_subnet, ip_subnet, or ip_address) is configured to run on that node. See page 208 for more information about the package networking parameters. • You cannot change the IP configuration of an interface used by the cluster in a single transaction (cmapplyconf).
PAGE 258
Cluster and Package Maintenance Reconfiguring a Cluster NODE_NAME NETWORK_INTERFACE HEARTBEAT_IP #NETWORK_INTERFACE #STATIONARY_IP NETWORK_INTERFACE NODE_NAME NETWORK_INTERFACE HEARTBEAT_IP #NETWORK_INTERFACE # STATIONARY_IP NETWORK_INTERFACE ftsys9 lan1 192.3.17.18 lan0 15.13.170.18 lan3 ftsys10 lan1 192.3.17.19 lan0 15.13.170.19 lan3 Step 2.
PAGE 259
Cluster and Package Maintenance Reconfiguring a Cluster NODE_NAME NETWORK_INTERFACE HEARTBEAT_IP NETWORK_INTERFACE HEARTBEAT_IP NETWORK_INTERFACE NODE_NAME NETWORK_INTERFACE HEARTBEAT_IP NETWORK_INTERFACE HEARTBEAT_IP NETWORK_INTERFACE ftsys9 lan1 192.3.17.18 lan0 15.13.170.18 lan3 ftsys10 lan1 192.3.17.19 lan0 15.13.170.19 lan3 Step 3. Verify the new configuration: cmcheckconf -C clconfig.ascii Step 4.
PAGE 260
Cluster and Package Maintenance Reconfiguring a Cluster Example: Deleting a Subnet Used by a Package In this example, we are deleting subnet 15.13.170.0 (lan0). Proceed as follows. Step 1. Halt any package that uses this subnet and delete the corresponding networking information (monitored_subnet, ip_subnet, ip_address; see page 208). See “Reconfiguring a Package on a Running Cluster” on page 273 for more information. Step 2.
PAGE 261
Cluster and Package Maintenance Reconfiguring a Cluster NODE_NAME NETWORK_INTERFACE HEARTBEAT_IP ftsys9 lan1 192.3.17.18 # NETWORK_INTERFACE lan0 # STATIONARY_IP 15.13.170.18 # NETWORK_INTERFACE lan3 NODE_NAME NETWORK_INTERFACE HEARTBEAT_IP # NETWORK_INTERFACE # STATIONARY_IP # NETWORK_INTERFACE ftsys10 lan1 192.3.17.19 lan0 15.13.170.19 lan3 Step 4. Verify the new configuration: cmcheckconf -C clconfig.ascii Step 5.
PAGE 262
Cluster and Package Maintenance Configuring a Legacy Package Configuring a Legacy Package IMPORTANT You can still create a new legacy package. If you are using a Serviceguard Toolkit such as Serviceguard NFS Toolkit, consult the documentation for that product. Otherwise, use this section to maintain and re-work existing legacy packages rather than to create new ones.
PAGE 263
Cluster and Package Maintenance Configuring a Legacy Package Using Serviceguard Manager to Configure a Package You can create a legacy package and its control script in Serviceguard Manager; use the Help for detailed instructions. Using Serviceguard Commands to Configure a Package Use the following procedure to create a legacy package. Step 1. Create a subdirectory for each package you are configuring in the $SGCONF directory: mkdir $SGCONF/pkg1 You can use any directory names you like.
PAGE 264
Cluster and Package Maintenance Configuring a Legacy Package Configuring a Package in Stages It is a good idea to configure failover packages in stages, as follows: 1. Configure volume groups and mount points only. 2. Distribute the control script to all nodes. 3. Apply the configuration. 4. Run the package and ensure that it can be moved from node to node. 5. Halt the package. 6. Configure package IP addresses and application services in the control script. 7. Distribute the control script to all nodes. 8.
PAGE 265
Cluster and Package Maintenance Configuring a Legacy Package • FAILBACK_POLICY. For failover packages, enter the failback_policy (see page 205). • NODE_NAME. Enter the node or nodes on which the package can run; as described under node_name (see page 202). • AUTO_RUN. Configure the package to start up automatically or manually; as described under auto_run (see page 202). • NODE_FAIL_FAST_ENABLED. Enter the policy as described under node_fail_fast_enabled (see page 203).
PAGE 266
Cluster and Package Maintenance Configuring a Legacy Package • ACCESS_CONTROL_POLICY. You can grant a non-root user PACKAGE_ADMIN privileges for this package. See the entries for user_name, user_host, and user_role on page 216, and “Access Roles” on page 145, for more information. • If the package will depend on another package, enter values for DEPENDENCY_NAME, DEPENDENCY_CONDITION, and DEPENDENCY_LOCATION.
PAGE 267
Cluster and Package Maintenance Configuring a Legacy Package Customizing the Package Control Script Check the definitions and declarations at the beginning of the control script using the information in the Package Configuration worksheet. You need to customize as follows; see the relevant entries under “Package Parameter Explanations” on page 201 for more discussion. • Update the PATH statement to reflect any required paths needed to start your services.
PAGE 268
Cluster and Package Maintenance Configuring a Legacy Package Excerpt from Legacy Package Control Script: Remote Data Replication, Software RAID Data Replication and MD RAID Sections # REMOTE DATA REPLICATION DEFINITION # Specify the remote data replication method. # Leave the default, DATA_REP="none", if remote data replication is not used. # # If remote data replication is used for the package application data, set # the variable DATA_REP to the data replication method.
PAGE 269
Cluster and Package Maintenance Configuring a Legacy Package # replaced with the appropriate multipath device name. # # For example: # RAIDTAB="/usr/local/cmcluster/conf/raidtab.sg" # #RAIDTAB="" # MD (RAID) COMMANDS # Specify the method of activation and deactivation for md. # Leave the default (RAIDSTART="raidstart", "RAIDSTOP="raidstop") if you want # md to be started and stopped with default methods.
PAGE 270
Cluster and Package Maintenance Configuring a Legacy Package # You should define all actions you want to happen here, before the service is # halted. function customer_defined_halt_cmds { # ADD customer defined halt commands. : # do nothing instruction, because a function must contain some command. date >> /tmp/pkg1.datelog echo 'Halting pkg1' >> /tmp/pkg1.
PAGE 271
Cluster and Package Maintenance Configuring a Legacy Package Verifying the Package Configuration Serviceguard checks the configuration you create and reports any errors. For legacy packages, you can do this in Serviceguard Manager: click Check to verify the package configuration you have done under any package configuration tab, or to check changes you have made to the control script. Click Apply to verify the package as a whole. See the local Help for more details.
PAGE 272
Cluster and Package Maintenance Configuring a Legacy Package Distributing the Configuration And Control Script with Serviceguard Manager When you have finished creating a package in Serviceguard Manager, click Apply Configuration. If the package configuration has no errors, it is converted to a binary file and distributed to the cluster nodes.
PAGE 273
Cluster and Package Maintenance Reconfiguring a Package Reconfiguring a Package You reconfigure a package in much the same way as you originally configured it; for modular packages, see Chapter 6, “Configuring Packages and Their Services,” on page 191; for older packages, see “Configuring a Legacy Package” on page 262.
PAGE 274
Cluster and Package Maintenance Reconfiguring a Package 3. Edit the package configuration file. IMPORTANT Restrictions on package names, dependency names, and service names have become more stringent as of A.11.18. Packages that have or contain names that do not conform to the new rules (spelled out under package_name on page 201) will continue to run, but if you reconfigure these packages, you will need to change the names that do not conform; cmcheckconf and cmapplyconf will enforce the new rules. 4.
PAGE 275
Cluster and Package Maintenance Reconfiguring a Package cmapplyconf -P $SGCONF/pkg1/pkg1conf.ascii If this is a legacy package, remember to copy the control script to the $SGCONF/pkg1 directory on all nodes that can run the package. Deleting a Package from a Running Cluster Serviceguard will not allow you to delete a package if any other package is dependent on it. To check for dependencies, use cmviewcl -v -l . System multi-node packages cannot be deleted from a running cluster.
PAGE 276
Cluster and Package Maintenance Reconfiguring a Package Allowable Package States During Reconfiguration In many cases, you can make changes to a package’s configuration while the package is running. Table 7-2 shows exceptions - cases in which the package must not be running, or in which the results might not be what you expect. Parameters not listed in the table can be changed while the package is running.
PAGE 277
Cluster and Package Maintenance Reconfiguring a Package Table 7-2 Chapter 7 Types of Changes to Packages (Continued) Change to the Package Required Package State Change halt script contents (legacy package) Package should not be running. Timing problems may occur if the script is changed while the package is running. Add a service Package must not be running. Remove or change a service Package must not be running. Add a subnet Package must not be running.
PAGE 278
Cluster and Package Maintenance Reconfiguring a Package Table 7-2 Types of Changes to Packages (Continued) Change to the Package 278 Required Package State Service failfast Package must not be running. Package AutoRun Package can be either running or halted. Add or delete a configured dependency Both packages can be either running or halted with one exception: If a running package adds a package dependency, the package it is to depend on must already be running on the same node(s).
PAGE 279
Cluster and Package Maintenance Responding to Cluster Events Responding to Cluster Events Serviceguard does not require much ongoing system administration intervention. As long as there are no failures, your cluster will be monitored and protected. In the event of a failure, those packages that you have designated to be transferred to another node will be transferred automatically.
PAGE 280
Cluster and Package Maintenance Single-Node Operation Single-Node Operation In a multi-node cluster, you could have a situation in which all but one node has failed, or you have shut down all but one node, leaving your cluster in single-node operation. This remaining node will probably have applications running on it. As long as the Serviceguard daemon cmcld is active, other nodes can rejoin the cluster.
PAGE 281
Cluster and Package Maintenance Removing Serviceguard from a System Removing Serviceguard from a System If you want to disable a node permanently from Serviceguard use, use the rpm -e command to delete the software. CAUTION Remove the node from the cluster first. If you run the rpm -e command on a server that is still a member of a cluster, it will cause that cluster to halt, and the cluster to be deleted. To remove Serviceguard: 1. If the node is an active member of a cluster, halt the node first. 2.
PAGE 282
Cluster and Package Maintenance Removing Serviceguard from a System 282 Chapter 7
PAGE 283
Troubleshooting Your Cluster 8 Troubleshooting Your Cluster This chapter describes how to verify cluster operation, how to review cluster status, how to add and replace hardware, and how to solve some typical cluster problems.
PAGE 284
Troubleshooting Your Cluster Testing Cluster Operation Testing Cluster Operation Once you have configured your Serviceguard cluster, you should verify that the various components of the cluster behave correctly in case of a failure. In this section, the following procedures test that the cluster responds properly in the event of a package failure, a node failure, or a LAN failure.
PAGE 285
Troubleshooting Your Cluster Testing Cluster Operation Depending on the specific databases you are running, perform the appropriate database recovery. Testing the Cluster Manager To test that the cluster manager is operating correctly, perform the following steps for each node on the cluster: 1. Turn off the power to the node. 2.
PAGE 286
Troubleshooting Your Cluster Testing Cluster Operation Testing the Network Manager To test that the Network Manager is operating correctly, do the following for each node in the cluster: 1. Identify the LAN cards on the node: ifconfig and then cmviewcl -v 2. Detach the LAN connection from one card. 3. Use cmviewcl to verify that the network is still functioning through the other cards: cmviewcl -v 4.
PAGE 287
Troubleshooting Your Cluster Monitoring Hardware Monitoring Hardware Good standard practice in handling a high availability system includes careful fault monitoring so as to prevent failures if possible or at least to react to them swiftly when they occur. For information about disk monitoring, see “Creating a Disk Monitor Configuration” on page 228.
PAGE 288
Troubleshooting Your Cluster Replacing Disks Replacing Disks The procedure for replacing a faulty disk mechanism depends on the type of disk configuration you are using. Refer to your Smart Array documentation for issues related to your Smart Array. Replacing a Faulty Mechanism in a Disk Array You can replace a failed disk mechanism by simply removing it from the array and replacing it with a new mechanism of the same type. The resynchronization is handled by the array itself.
PAGE 289
Troubleshooting Your Cluster Replacing Disks CAUTION You are responsible for determining that the device is not being used by LVM or any other subsystem on any node connected to the device before using cmdisklock. If you use cmdisklock without taking this precaution, you could lose data. NOTE cmdisklock is needed only when you are repairing or replacing a lock LUN; see the cmdisklock (1m) manpage for more information. Serviceguard checks the lock disk every 75 seconds.
PAGE 290
Troubleshooting Your Cluster Replacing LAN Cards Replacing LAN Cards If you need to replace a LAN card, use the following steps. It is not necessary to bring the cluster down to do this. Step 1. Halt the node using the cmhaltnode command. Step 2. Shut down the system: shutdown -h Then power off the system. Step 3. Remove the defective LAN card. Step 4. Install the new LAN card. The new card must be exactly the same card type, and it must be installed in the same slot as the card you removed. Step 5.
PAGE 291
Troubleshooting Your Cluster Replacing LAN Cards 1. Use the cmgetconf command to obtain a fresh ASCII configuration file, as follows: cmgetconf config.ascii 2. Use the cmapplyconf command to apply the configuration and copy the new binary file to all cluster nodes: cmapplyconf -C config.ascii This procedure updates the binary file with the new MAC address and thus avoids data inconsistency between the outputs of the cmviewconf and ifconfig commands.
PAGE 292
Troubleshooting Your Cluster Replacing a Failed Quorum Server System Replacing a Failed Quorum Server System When a quorum server fails or becomes unavailable to the clusters it is providing quorum services for, this will not cause a failure on any cluster. However, the loss of the quorum server does increase the vulnerability of the clusters in case there is an additional failure. Use the following procedure to replace a defective quorum server system.
PAGE 293
Troubleshooting Your Cluster Replacing a Failed Quorum Server System • Create a package in another cluster for the Quorum Server, as described in the Release Notes for your version of Quorum Server. They can be found at http://docs.hp.com ->High Availability ->Quorum Server. 5. All nodes in all clusters that were using the old quorum server will connect to the new quorum server.
PAGE 294
Troubleshooting Your Cluster Replacing a Failed Quorum Server System CAUTION 294 Make sure that the old system does not re-join the network with the old IP address.
PAGE 295
Troubleshooting Your Cluster Troubleshooting Approaches Troubleshooting Approaches The following sections offer a few suggestions for troubleshooting by reviewing the state of the running system and by examining cluster status data, log files, and configuration files.
PAGE 296
Troubleshooting Your Cluster Troubleshooting Approaches Reviewing Package IP Addresses The ifconfig command can be used to examine the LAN configuration. The command, if executed on ftsys9 after the halting of node ftsys10, shows that the package IP addresses are assigned to eth1:1 and eth1:2 along with the heartbeat IP address on eth1. 296 eth0 Link encap:Ethernet HWaddr 00:01:02:77:82:75 inet addr:15.13.169.106 Bcast:15.13.175.255 Mask:255.255.248.
PAGE 297
Troubleshooting Your Cluster Troubleshooting Approaches Reviewing the System Log File Messages from the Cluster Manager and Package Manager are written to the system log file. The default location of the log file may vary according to Linux distribution; the Red Hat default is /var/log/messages. You can use a text editor, such as vi, or the more command to view the log file for historical information on your cluster.
PAGE 298
Troubleshooting Your Cluster Troubleshooting Approaches Sample System Log Entries The following sample entries from the syslog file show a package that failed to run because of a problem in the pkg5_run script. You would look at the pkg5_run.log for details. Dec Dec Dec Dec 14 14:33:48 star04 cmcld[2048]: Starting cluster management protocols.
PAGE 299
Troubleshooting Your Cluster Troubleshooting Approaches Reviewing Object Manager Log Files The Serviceguard Object Manager daemon cmomd logs messages to the file /usr/local/cmom/cmomd.log on Red Hat and /var/log/cmmomcmomd.log on SUSE. You can review these messages using the cmreadlog command, for example: /usr/local/cmom/bin/cmreadlog /usr/local/cmom/log/cmomd.log Messages from cmomd include information about the processes that request data from the Object Manager, including type of data, timestamp, etc.
PAGE 300
Troubleshooting Your Cluster Troubleshooting Approaches Using the cmquerycl and cmcheckconf Commands In addition, cmquerycl and cmcheckconf can be used to troubleshoot your cluster just as they were used to verify its configuration. The following example shows the commands used to verify the existing cluster configuration on ftsys9 and ftsys10: cmquerycl -v -C $SGCONF/verify.ascii -n ftsys9 -n ftsys10 cmcheckconf -v -C $SGCONF/verify.
PAGE 301
Troubleshooting Your Cluster Solving Problems Solving Problems Problems with Serviceguard may be of several types. The following is a list of common categories of problem: • Serviceguard Command Hangs. • Cluster Re-formations. • System Administration Errors. • Package Control Script Hangs. • Package Movement Errors. • Node and Network Failures. • Quorum Server Messages.
PAGE 302
Troubleshooting Your Cluster Solving Problems Cluster Re-formations Cluster re-formations may occur from time to time due to current cluster conditions. Some of the causes are as follows: • local switch on an Ethernet LAN if the switch takes longer than the cluster NODE_TIMEOUT value. To prevent this problem, you can increase the cluster NODE_TIMEOUT value. • excessive network traffic on heartbeat LANs. To prevent this, you can use dedicated heartbeat LANs, or LANs with less traffic on them.
PAGE 303
Troubleshooting Your Cluster Solving Problems • fdisk -v /dev/sdx - to display information about a disk. Package Control Script Hangs or Failures When a RUN_SCRIPT_TIMEOUT or HALT_SCRIPT_TIMEOUT value is set, and the control script hangs, causing the timeout to be exceeded, Serviceguard kills the script and marks the package “Halted.” Similarly, when a package control script fails, Serviceguard kills the script and marks the package “Halted.
PAGE 304
Troubleshooting Your Cluster Solving Problems where is the address indicated above and is the result of masking the with the mask found in the same line as the inet address in the ifconfig output. 3. Ensure that package volume groups are deactivated. First unmount any package logical volumes which are being used for file systems. This is determined by inspecting the output resulting from running the command df -l.
PAGE 305
Troubleshooting Your Cluster Solving Problems Package Movement Errors These errors are similar to the system administration errors except they are caused specifically by errors in the package control script. The best way to prevent these errors is to test your package control script before putting your high availability application on line. Adding a “set -x” statement in the second line of your control script will give you details on where your script may be failing.
PAGE 306
Troubleshooting Your Cluster Solving Problems Quorum Server Messages The coordinator node in Serviceguard sometimes sends a request to the quorum server to set the lock state. (This is different from a request to obtain the lock in tie-breaking.
PAGE 307
Serviceguard Commands A Serviceguard Commands The following is an alphabetical list of commands used for ServiceGuard cluster configuration and maintenance. Man pages for these commands are available on your system after installation. Table A-1 Serviceguard Commands Command cmapplyconf Description Verify and apply ServiceGuard cluster configuration and package configuration files.
PAGE 308
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmapplyconf (continued) Description Run cmgetconf to get either the cluster or package configuration file whenever changes to the existing configuration are required. Note that cmapplyconf will verify and distribute cluster configuration or package files. It will not cause the cluster daemon to start or removed from the cluster configuration.
PAGE 309
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmdeleteconf Description Delete either the cluster or the package configuration. cmdeleteconf deletes either the entire cluster configuration, including all its packages, or only the specified package configuration. If neither cluster_name nor package_name is specified, cmdeleteconf will delete the local cluster’s configuration and all its packages.
PAGE 310
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmhaltcl Description Halt a high availability cluster. cmhaltcl causes all nodes in a configured cluster to stop their cluster daemons, optionally halting all packages or applications in the process. This command will halt all the daemons on all currently running systems. If the user only wants to shutdown a subset of daemons, the cmhaltnode command should be used instead. cmhaltnode Halt a node in a high availability cluster.
PAGE 311
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmhaltserv Description Halt a service from the high availability package halt script. This is not a command line executable command, it runs only from within the package control script. cmhaltserv is used in the high availability package halt script to halt a service. If any part of package is marked down, the package halt script is executed as part of the recovery process.
PAGE 312
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmmodnet Description Add or remove an address from a high availability cluster. cmmodnet is used to add or remove a relocatable package IP_address for the current network interface running the given subnet_name. cmmodnet can also be used to enable or disable a LAN_name currently configured in a cluster.
PAGE 313
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmquerycl Description Query cluster or node configuration information. cmquerycl searches all specified nodes for cluster configuration, cluster lock, and Logical Volume Manager (LVM) information. Cluster configuration information includes network information such as LAN interface, IP addresses, bridged networks and possible heartbeat networks.
PAGE 314
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command Description cmresserviced Request monitoring of a device. This command is used in the SERVICE_CMD parameter of the package control script to define package dependencies on monitored disks. cmruncl Run a high availability cluster. cmruncl causes all nodes in a configured cluster or all nodes specified to start their cluster daemons and form a new cluster.
PAGE 315
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmrunserv Description Run a service from the high availability package run script. This is not a command line executable command, it runs only from within the package control script. cmrunserv is used in the high availability package run script to run a service. If the service process dies, cmrunserv updates the status of the service to down.
PAGE 316
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmscancl Description Gather system configuration information from nodes with ServiceGuard installed. cmscancl is a configuration report and diagnostic tool which gathers system software and hardware configuration information from a list of nodes, or from all the nodes in a cluster.
PAGE 317
Serviceguard Commands Table A-1 Serviceguard Commands (Continued) Command cmviewconf Description View Serviceguard or ServiceGuard cluster configuration information. cmviewconf collects and displays the cluster configuration information, in ASCII format, from the binary configuration file for an existing cluster. Optionally, the output can be written to a file. This command can be used as a troubleshooting tool to identify the configuration of a cluster.
PAGE 318
Serviceguard Commands 318 Appendix A
PAGE 319
Designing Highly Available Cluster Applications B Designing Highly Available Cluster Applications This appendix describes how to create or port applications for high availability, with emphasis on the following topics: • Automating Application Operation • Controlling the Speed of Application Failover • Designing Applications to Run on Multiple Systems • Restoring Client Connections • Handling Application Failures • Minimizing Planned Downtime Designing for high availability means reducing the
PAGE 320
Designing Highly Available Cluster Applications Automating Application Operation Automating Application Operation Can the application be started and stopped automatically or does it require operator intervention? This section describes how to automate application operations to avoid the need for user intervention. One of the first rules of high availability is to avoid manual intervention.
PAGE 321
Designing Highly Available Cluster Applications Automating Application Operation Define Application Startup and Shutdown Applications must be restartable without manual intervention. If the application requires a switch to be flipped on a piece of hardware, then automated restart is impossible. Procedures for application startup, shutdown and monitoring must be created so that the HA software can perform these functions automatically.
PAGE 322
Designing Highly Available Cluster Applications Controlling the Speed of Application Failover Controlling the Speed of Application Failover What steps can be taken to ensure the fastest failover? If a failure does occur causing the application to be moved (failed over) to another node, there are many things the application can do to reduce the amount of time it takes to get the application back up and running.
PAGE 323
Designing Highly Available Cluster Applications Controlling the Speed of Application Failover Evaluate the Use of a Journaled Filesystem (JFS) If a file system must be used, a JFS offers significantly faster file system recovery than an HFS. However, performance of the JFS may vary with the application. An example of an appropriate JFS is the Reiser FS or ext3. Minimize Data Loss Minimize the amount of data that might be lost at the time of an unplanned outage.
PAGE 324
Designing Highly Available Cluster Applications Controlling the Speed of Application Failover Keep Logs Small Some databases permit logs to be buffered in memory to increase online performance. Of course, when a failure occurs, any in-flight transaction will be lost. However, minimizing the size of this in-memory log will reduce the amount of completed transaction data that would be lost in case of failure.
PAGE 325
Designing Highly Available Cluster Applications Controlling the Speed of Application Failover Another example is an application where a clerk is entering data about a new employee. Suppose this application requires that employee numbers be unique, and that after the name and number of the new employee is entered, a failure occurs.
PAGE 326
Designing Highly Available Cluster Applications Controlling the Speed of Application Failover Design for Multiple Servers If you use multiple active servers, multiple service points can provide relatively transparent service to a client. However, this capability requires that the client be smart enough to have knowledge about the multiple servers and the priority for addressing them. It also requires access to the data of the failed server or replicated data.
PAGE 327
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems Designing Applications to Run on Multiple Systems If an application can be failed to a backup node, how will it work on that different system? The previous sections discussed methods to ensure that an application can be automatically restarted. This section will discuss some ways to ensure the application can run on multiple systems.
PAGE 328
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems Each application or package should be given a unique name as well as a relocatable IP address. Following this rule separates the application from the system on which it runs, thus removing the need for user knowledge of which system the application runs on. It also makes it easier to move the application among different systems in a cluster for for load balancing or other reasons.
PAGE 329
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems Avoid Using SPU IDs or MAC Addresses Design the application so that it does not rely on the SPU ID or MAC (link-level) addresses. The SPU ID is a unique hardware ID contained in non-volatile memory, which cannot be changed. A MAC address (also known as a NIC id) is a link-specific address associated with the LAN hardware.
PAGE 330
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems Assign Unique Names to Applications A unique name should be assigned to each application. This name should then be configured in DNS so that the name can be used as input to gethostbyname(3), as described in the following discussion. Use DNS DNS provides an API which can be used to map hostnames to IP addresses and vice versa.
PAGE 331
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems Use uname(2) With Care Related to the hostname issue discussed in the previous section is the application's use of uname(2), which returns the official system name. The system name is unique to a given system whatever the number of LAN cards in the system. By convention, the uname and hostname are the same, but they do not have to be.
PAGE 332
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems Bind to a Fixed Port When binding a socket, a port address can be specified or one can be assigned dynamically. One issue with binding to random ports is that a different port may be assigned if the application is later restarted on another cluster node. This may be confusing to clients accessing the application.
PAGE 333
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems With UDP datagram sockets, however, there is a problem. The client may connect to multiple servers utilizing the relocatable IP address and sort out the replies based on the source IP address in the server’s response message. However, the source IP address given in this response will be the stationary IP address rather than the relocatable application IP address.
PAGE 334
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems Give Each Application its Own Volume Group Use separate volume groups for each application that uses data. If the application doesn't use disk, it is not necessary to assign it a separate volume group. A volume group (group of disks) is the unit of storage that can move between nodes. The greatest flexibility for load balancing exists when each application is confined to its own volume group, i.e.
PAGE 335
Designing Highly Available Cluster Applications Designing Applications to Run on Multiple Systems Avoid File Locking In an NFS environment, applications should avoid using file-locking mechanisms, where the file to be locked is on an NFS Server. File locking should be avoided in an application both on local and remote systems. If local file locking is employed and the system fails, the system acting as the backup system will not have any knowledge of the locks maintained by the failed system.
PAGE 336
Designing Highly Available Cluster Applications Restoring Client Connections Restoring Client Connections How does a client reconnect to the server after a failure? It is important to write client applications to specifically differentiate between the loss of a connection to the server and other application-oriented errors that might be returned. The application should take special action in case of connection loss.
PAGE 337
Designing Highly Available Cluster Applications Restoring Client Connections the retry to the current server should continue for the amount of time it takes to restart the server locally. This will keep the client from having to switch to the second server in the event of a application failure. • Use a transaction processing monitor or message queueing software to increase robustness.
PAGE 338
Designing Highly Available Cluster Applications Handling Application Failures Handling Application Failures What happens if part or all of an application fails? All of the preceding sections have assumed the failure in question was not a failure of the application, but of another component of the cluster. This section deals specifically with application problems.
PAGE 339
Designing Highly Available Cluster Applications Handling Application Failures Be Able to Monitor Applications All components in a system, including applications, should be able to be monitored for their health. A monitor might be as simple as a display command or as complicated as a SQL query. There must be a way to ensure that the application is behaving correctly.
PAGE 340
Designing Highly Available Cluster Applications Minimizing Planned Downtime Minimizing Planned Downtime Planned downtime (as opposed to unplanned downtime) is scheduled; examples include backups, systems upgrades to new operating system revisions, or hardware replacements. For planned downtime, application designers should consider: • Reducing the time needed for application upgrades/patches.
PAGE 341
Designing Highly Available Cluster Applications Minimizing Planned Downtime Reducing Time Needed for Application Upgrades and Patches Once a year or so, a new revision of an application is released. How long does it take for the end-user to upgrade to this new revision? This answer is the amount of planned downtime a user must take to upgrade their application. The following guidelines reduce this time. Provide for Rolling Upgrades Provide for a “rolling upgrade” in a client/server environment.
PAGE 342
Designing Highly Available Cluster Applications Minimizing Planned Downtime Do Not Change the Data Layout Between Releases Migration of the data to a new format can be very time intensive. It also almost guarantees that rolling upgrade will not be possible. For example, if a database is running on the first node, ideally, the second node could be upgraded to the new revision of the database.
PAGE 343
Integrating HA Applications with Serviceguard C Integrating HA Applications with Serviceguard The following is a summary of the steps you should follow to integrate an application into the Serviceguard environment: 1. Read the rest of this book, including the chapters on cluster and package configuration, and the appendix “Designing Highly Available Cluster Applications.” 2.
PAGE 344
Integrating HA Applications with Serviceguard Checklist for Integrating HA Applications Checklist for Integrating HA Applications This section contains a checklist for integrating HA applications in both single and multiple systems. Defining Baseline Application Behavior on a Single System 1. Define a baseline behavior for the application on a standalone system: • Install the application, database, and other required resources on one of the systems.
PAGE 345
Integrating HA Applications with Serviceguard Checklist for Integrating HA Applications Integrating HA Applications in Multiple Systems 1. Install the application on a second system. • Create the LVM infrastructure on the second system. • Add the appropriate users to the system. • Install the appropriate executables. • With the application not running on the first system, try to bring it up on the second system. You might use the script you created in the step above.
PAGE 346
Integrating HA Applications with Serviceguard Checklist for Integrating HA Applications Testing the Cluster 1. Test the cluster: • Have clients connect. • Provide a normal system load. • Halt the package on the first node and move it to the second node: # cmhaltpkg pkg1 # cmrunpkg -n node2 pkg1 # cmmodpkg -e pkg1 • Move it back. # cmhaltpkg pkg1 # cmrunpkg -n node1 pkg1 # cmmodpkg -e pkg1 • Fail one of the systems. For example, turn off the power on node 1.
PAGE 347
Blank Planning Worksheets D Blank Planning Worksheets This appendix reprints blank versions of the planning worksheets described in the “Planning” chapter. You can duplicate any of these worksheets that you find useful and fill them in as a part of the planning process.
PAGE 348
Blank Planning Worksheets Hardware Worksheet Hardware Worksheet ============================================================================= SPU Information: Host Name ____________________ Server Series____________ Memory Capacity ____________ Number of I/O Slots ____________ ============================================================================= LAN Information: Name of Master _________ Name of Node IP Traffic Interface __________ Addr________________ Type ________ Name of Name of Node IP Traff
PAGE 349
Blank Planning Worksheets Power Supply Worksheet Power Supply Worksheet ============================================================================ SPU Power: Host Name ____________________ Power Supply _____________________ Host Name ____________________ Power Supply _____________________ ============================================================================ Disk Power: Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply
PAGE 350
Blank Planning Worksheets Quorum Server Worksheet Quorum Server Worksheet Quorum Server Data: ============================================================================== QS Hostname: _________________IP Address: ______________________ OR Cluster Name: _________________ Package Name: ____________ Package IP Address: ___________________ Hostname Given to Package by Network Administrator: _________________ ============================================================================== Quorum Services are Pr
PAGE 351
Blank Planning Worksheets Volume Group and Physical Volume Worksheet Volume Group and Physical Volume Worksheet ============================================================================== Volume Group Name: ___________________________________ Physical Volume Name: _________________ Physical Volume Name: _________________ Physical Volume Name: _________________ ============================================================================= Volume Group Name: ___________________________________ Physical Vol
PAGE 352
Blank Planning Worksheets Cluster Configuration Worksheet Cluster Configuration Worksheet =============================================================================== Name and Nodes: =============================================================================== Cluster Name: ______________________________ Node Names: ________________________________________________ Maximum Configured Packages: ______________ =============================================================================== Cluster Lock Da
PAGE 353
Blank Planning Worksheets Cluster Configuration Worksheet Access Policies User: ________ Host: ________ Role: ________ User: _________ Host: _________ Role: __________ Appendix D 353
PAGE 354
Blank Planning Worksheets Package Configuration Worksheet Package Configuration Worksheet ============================================================================= Package Configuration File Data: ========================================================================== Package Name: __________________Package Type:______________ Primary Node: ____________________ First Failover Node:__________________ Additional Failover Nodes:__________________________________ Run Script Timeout: _____ Halt Script Ti
PAGE 355
Blank Planning Worksheets Package Configuration Worksheet Logical Volumes and File Systems: fs_name___________________ fs_directory________________fs_mount_opt____________ fs_umount_opt______________fs_fsck_opt_________________fs_type_________________ fs_name____________________fs_directory________________fs_mount_opt____________ fs_umount_opt_____________ fs_fsck_opt_________________fs_type_________________ fs_name____________________fs_directory________________fs_mount_opt____________ fs_umount_opt_______
PAGE 356
Blank Planning Worksheets Package Control Script Worksheet (Legacy) Package Control Script Worksheet (Legacy) PACKAGE CONTROL SCRIPT WORKSHEET Page ___ of ___ ================================================================================ Package Control Script Data: ================================================================================ PATH______________________________________________________________ VGCHANGE_________________________________ VG[0]__________________LV[0]______________________FS
PAGE 357
IPv6 Network Support E IPv6 Network Support This appendix describes some of the characteristics of IPv6 network addresses, specifically: Appendix E • “IPv6 Address Types” on page 358 • “Network Configuration Restrictions” on page 364 • “Configuring IPv6 on Linux” on page 366 357
PAGE 358
IPv6 Network Support IPv6 Address Types IPv6 Address Types Several IPv6 types of addressing schemes are specified in the RFC 2373 (IPv6 Addressing Architecture). IPv6 addresses are 128-bit identifiers for interfaces and sets of interfaces. There are various address formats for IPv6 defined by the RFC 2373. IPv6 addresses are broadly classified as unicast, anycast, and multicast. The following table explains the three types. Table E-1 IPv6 Address Types Unicast An address for a single interface.
PAGE 359
IPv6 Network Support IPv6 Address Types multiple groups of 16-bits of zeros. The “::” can appear only once in an address and it can be used to compress the leading, trailing, or contiguous sixteen-bit zeroes in an address. Example: fec0:1:0:0:0:0:0:1234 can be represented as fec0:1::1234. • In a mixed environment of IPv4 and IPv6 nodes an alternative form of IPv6 address will be used. It is x:x:x:x:x:x:d.d.d.
PAGE 360
IPv6 Network Support IPv6 Address Types Unicast Addresses IPv6 unicast addresses are classified into different types. They are: global aggregatable unicast address, site-local address and link-local address. Typically a unicast address is logically divided as follows: Table E-2 n bits 128-n bits Subnet prefix Interface ID Interface identifiers in a IPv6 unicast address are used to identify the interfaces on a link. Interface identifiers are required to be unique on that link.
PAGE 361
IPv6 Network Support IPv6 Address Types IPv4 Mapped IPv6 Address There is a special type of IPv6 address that holds an embedded IPv4 address. This address is used to represent the addresses of IPv4-only nodes as IPv6 addresses. These addresses are used especially by applications that support both IPv6 and IPv4. These addresses are called as IPv4 Mapped IPv6 Addresses. The format of these address is as follows: Table E-4 80 bits 16 bits zeros 32 bits FFFF IPv4 address Example: ::ffff:192.168.0.
PAGE 362
IPv6 Network Support IPv6 Address Types Link-Local Addresses Link-local addresses have the following format: Table E-6 10 bits 1111111010 54 bits 0 64 bits interface ID Link-local address are supposed to be used for addressing nodes on a single link. Packets originating from or destined to a link-local address will not be forwarded by a router.
PAGE 363
IPv6 Network Support IPv6 Address Types “FF” at the beginning of the address identifies the address as a multicast address. The “flags” field is a set of 4 flags “000T”. The higher order 3 bits are reserved and must be zero. The last bit ‘T’ indicates whether it is permanently assigned or not. A value of zero indicates that it is permanently assigned otherwise it is a temporary assignment. The “scop” field is a 4-bit field which is used to limit the scope of the multicast group.
PAGE 364
IPv6 Network Support Network Configuration Restrictions Network Configuration Restrictions Serviceguard supports IPv6 for data links only: the heartbeat IP address must be IPv4, but the package IP addresses can be IPv4 or IPv6. The restrictions for supporting IPv6 in Serviceguard for Linux are:. NOTE 364 • The heartbeat IP address must be IPv4. IPv6-only nodes are not supported in a Serviceguard environment. • The hostnames in a Serviceguard configuration must be IPv4.
PAGE 365
IPv6 Network Support Network Configuration Restrictions Appendix E • Bonding is supported for IPv6 addresses, but only in active-backup mode. • The Quorum server, if used, must be configured on an IPv4 network. It is not IPv6-capable. A quorum server configured on an IPv4 network can still be used by Serviceguard IPv6 clusters that have IPv6 networks as a part of their cluster configuration. • Serviceguard supports IPv6 only on the Ethernet networks, including 10BT, 100BT, and Gigabit Ethernet.
PAGE 366
IPv6 Network Support Configuring IPv6 on Linux Configuring IPv6 on Linux Red Hat Enterprise Linux and SUSE Linux Enterprise Server already have the proper IPv6 tools installed, including the /sbin/ip command. This section explains how to configure IPv6 stationary IP addresses on these systems.
PAGE 367
IPv6 Network Support Configuring IPv6 on Linux Configuring a Channel Bonding Interface with Persistent IPv6 Addresses on Red Hat Linux Configure the following parameters in /etc/sysconfig/network-scripts/ifcfg-bond0: DEVICE=bond0 IPADDR=12.12.12.12 NETMASK=255.255.255.0 NETWORK=12.12.12.0 BROADCAST=12.12.12.255 IPV6INIT=yes IPV6ADDR=3ffe:ffff:0000:f101::10/64 IPV6ADDR_SECONDARIES=fec0:0:0:1::10/64 IPV6_MTU=1280 ONBOOT=yes BOOTPROTO=none USERCTL=no Add the following two lines to /etc/modprobe.
PAGE 368
IPv6 Network Support Configuring IPv6 on Linux Configuring a Channel Bonding Interface with Persistent IPv6 Addresses on SUSE Configure the following parameters in /etc/sysconfig/network/ifcfg-bond0: BOOTPROTO=static BROADCAST=10.0.2.255 IPADDR=10.0.2.10 NETMASK=255.255.0.0 NETWORK=0.0.2.
PAGE 369
Index A Access Control Policies, 135, 145 active node, 20 adding a package to a running cluster, 274 adding nodes to a running cluster, 244 adding packages on a running cluster, 227 administration adding nodes to a running cluster, 244 halting a package, 248 halting the entire cluster, 246 moving a package, 249 of packages and services, 247 of the cluster, 242 reconfiguring a package while the cluster is running, 273 reconfiguring a package with the cluster offline, 274 reconfiguring the cluster, 252 removi
PAGE 370
Index and cluster reformation, example, 85 and power supplies, 31 storing configuration data, 176 two nodes, 42, 43 use in re-forming a cluster, 42, 43 cluster manager automatic restart of cluster, 41 blank planning worksheet, 352 cluster node parameter, 106, 107, 108 defined, 39 dynamic re-formation, 41 heartbeat interval parameter, 110 heartbeat subnet parameter, 108 initial configuration of the cluster, 39 main functions, 39 maximum configured packages parameter, 112 monitored non-heartbeat subnet, 110
PAGE 371
Index designing applications to run on multiple systems, 327 disk data, 29 interfaces, 29 root, 29 sample configurations, 30 disk I/O hardware planning, 97 disk layout planning, 104 disk logical units hardware planning, 97 disk monitoring configuring, 228 disks in Serviceguard, 29 replacing, 288 supported types in Serviceguard, 29 distributing the cluster and package configuration, 225, 271 down time minimizing planned, 340 dynamic cluster re-formation, 41 E enclosure for disks replacing a faulty mechanism,
PAGE 372
Index gethostbyname(), 330 H HALT_SCRIPT parameter in package configuration, 138 HALT_SCRIPT_TIMEOUT (halt script timeout) parameter in package configuration, 138 halting a cluster, 246 halting a package, 248 halting the entire cluster, 246 handling application failures, 338 hardware monitoring, 287 power supplies, 31 hardware failures response to, 86 hardware planning blank planning worksheet, 347 Disk I/O Bus Type, 97 disk I/O information for shared disks, 97 host IP address, 93, 102, 103 host name, 92 I/
PAGE 373
Index L LAN heartbeat, 39 interface name, 93, 102 LAN failure Serviceguard behavior, 26 LAN interfaces primary and secondary, 27 LAN planning host IP address, 93, 102, 103 traffic type, 93 link-level addresses, 329 load sharing with IP addresses, 72 local switching, 73 lock cluster locks and power supplies, 31 use of the cluster lock, 43 use of the cluster lock disk, 42 lock volume group, reconfiguring, 252 logical volume parameter in package control script, 138 logical volumes creating the infrastructure,
PAGE 374
Index node basic concepts, 26 halt (TOC), 85 in Serviceguard cluster, 18 IP addresses, 71 timeout and TOC example, 85 node types active, 20 primary, 20 NODE_FAIL_FAST_ENABLED effect of setting, 87 parameter in package configuration, 128 NODE_NAME parameter in cluster manager configuration, 106, 107, 108 NODE_TIMEOUT and HEARTBEAT_INTERVAL, 84 and node TOC, 84 NODE_TIMEOUT (node timeout) parameter in cluster manager configuration, 110 nodetypes primary, 20 O Object Manager, 299 outages insulating users from,
PAGE 375
Index defined, 71 reviewing, 296 package manager blank planning worksheet, 354, 356 testing, 284 package modules, 195 base, 196 optional, 198 package name parameter in package configuration, 127 package switching behavior changing, 249 package type parameter in package configuration, 127 PACKAGE_NAME parameter in package ASCII configuration file, 127 PACKAGE_TYPE parameter in package ASCII configuration file, 127 packages, 191 deciding where and when to run, 47, 48 managed by cmcld, 35 modular, 191 paramete
PAGE 376
Index redundant LANS figure, 27 redundant networks for heartbeat, 19 re-formation of cluster, 41 relocatable IP address defined, 71 relocatable IP addresses in Serviceguard packages, 71 remote switching, 77 removing nodes from operation in a running cluster, 245 removing packages on a running cluster, 227 removing Serviceguard from a system, 281 replacing disks, 288 resources disks, 29 responses to cluster events, 279 to package and service failures, 87 responses to failures, 84 responses to hardware failur
PAGE 377
Index software planning LVM, 104 solving problems, 301 SPU information planning, 92 standby LAN interfaces defined, 27 starting a package, 247 startup and shutdown defined for applications, 321 startup of cluster manual, 40 stationary IP addresses, 71 STATIONARY_IP parameter in cluster manager configuration, 110 status cmviewcl, 230 package IP address, 296 system log file, 297 stopping a cluster, 246 SUBNET array variable in package control script, 131, 133 in sample package control script, 267 parameter i
PAGE 378
Index volume group and physical volume planning, 104 Volume groups in control script, 133 in package configuration script, 132 VXVM_DG in package control script, 267 W What is Serviceguard?, 18 worksheet blanks, 347 cluster configuration, 112, 353 hardware configuration, 99, 347 package configuration, 354, 356 package configuration data, 135 power supply configuration, 100, 349, 350 quorum server configuration, 102 use in planning, 89 378