HP XC System Software Installation Guide Version 3.
© Copyright 2003, 2004, 2005, 2006 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document.......................................................................................................15 1 Intended Audience................................................................................................................................15 2 How to Use This Document....................................................................................................................15 3 Naming Conventions Used in This Document................................
3.1.2 System Imaging Process...............................................................................................................43 3.1.3 System Configuration and Imaging Log Files................................................................................44 3.2 Task 1: Prepare for the System Configuration........................................................................................44 3.3 Task 2: Change the Default IP Address Base (Optional)...........................................
.1.6 Is Upgrading Right for Your System?............................................................................................97 5.2 Task 1: Prepare for the Upgrade...........................................................................................................97 5.3 Task 2: Prepare the System State...........................................................................................................98 5.4 Task 3: Install the Upgrade RPM and Prepare the System...........................
F.2.2 Special Considerations for Systems with 63 or Fewer Nodes.........................................................138 F.2.3 Special Considerations for Systems with 64 or More Nodes..........................................................138 F.2.4 Special Considerations for Improved Availability.........................................................................138 F.3 Role Definitions.................................................................................................................
M.2 Readiness Criteria............................................................................................................................183 M.3 Before You Begin..............................................................................................................................183 M.4 Installation Procedure......................................................................................................................184 M.4.1 Task 1: Download the Maui Scheduler Kit...................
List of Figures N-1 Discovery Flowchart.........................................................................................................................
List of Tables 1 Installation Types...................................................................................................................................15 2 Naming Conventions..............................................................................................................................16 1-1 Improved Availability Summary.........................................................................................................
List of Examples 1-1 Sample XC.lic File....................................................................................................................................25 3-1 Sample mcs.ini File...................................................................................................................................65 N-1 Successful Content in /var/log/postinstall.log File....................................................................................197 N-2 Failure in /var/log/postinstall.
About This Document This document describes how to install and configure HP XC System Software Version 3.1 on HP Cluster Platforms 3000, 4000, and 6000. An HP XC system is integrated with several open source software components. Some open source software components are being used for underlying technology, and their deployment is transparent.
To avoid duplication of information, from time to time you are requested to reference information found in other documents in the HP XC System Software Documentation Set. To reduce the size of screen displays and command output in this document, a three- or four-node system was used to generate most of the sample command output shown in this document. 3 Naming Conventions Used in This Document Table 2 lists the naming conventions and sample IP addresses used in this document.
Variable [] {} ... | WARNING CAUTION IMPORTANT NOTE The name of a placeholder in a command, function, or other syntax display that you replace with an actual value. The contents are optional in syntax. If the contents are a list separated by |, you can choose one of the items. The contents are required in syntax. If the contents are a list separated by |, you must choose one of the items. The preceding element can be repeated an arbitrary number of times. Separates items in a list of choices.
HP Message Passing Interface HP Message Passing Interface (HP-MPI) is an implementation of the MPI standard that has been integrated in HP XC systems. The home page and documentation is located at the following Web site: http://www.hp.com/go/mpi HP Serviceguard HP Serviceguard is a service availability tool supported on an HP XC system. HP Serviceguard enables some system services to continue if a hardware or software failure occurs.
The Platform Computing Corporation LSF manpages are installed by default. The lsf_diff(7) manpage supplied by HP describes LSF command differences when using LSF-HPC with SLURM on an HP XC system The following documents in the HP XC System Software Documentation Set provide information about administering and using LSF on an HP XC system: — HP XC System Software Administration Guide — HP XC System Software User's Guide • http://www.llnl.
Linux Web Sites • http://www.redhat.com Home page for Red Hat®, distributors of Red Hat Enterprise Linux Advanced Server, a Linux distribution with which the HP XC operating environment is compatible. • http://www.linux.org/docs/index.html This Web site for the Linux Documentation Project (LDP) contains guides covering various aspects of working with Linux, from creating your own Linux system from scratch to bash script writing.
Additional Publications For more information about standard Linux system administration or other related software topics, consider using one of the following publications, which must be purchased separately: • Linux Administration Unleashed, by Thomas Schenk, et al. • Managing NFS and NIS, by Hal Stern, Mike Eisler, and Ricardo Labiaga (O'Reilly) • MySQL, by Paul Debois • MySQL Cookbook, by Paul Debois • High Performance MySQL, by Jeremy Zawodny and Derek J.
1 Preparing for a New Installation This chapter describes preinstallation tasks to perform before you install HP XC System Software Version 3.1.
1.3 Task 3: Prepare Existing HP XC Systems This task applies to anyone who is installing HP XC System Software Version 3.1 on an HP XC system that is already installed with an older version of the HP XC System Software. Bypass this task If you are installing HP XC System Software Version 3.1 on new hardware for the first time. Before using the procedures described in this document to install and configure HP XC System Software Version 3.
When you contact your network administrator about assigning IP addresses, you must also provide the host names for any system that will have a connection to an external network. Therefore, consider the following: • If you assign a login role to one or more nodes, what name will you use as the cluster alias? This name is the host name by which users will log in to the system. • What name will you use as the host name for any node that has a connection to the external network? See Section B.
1.8 Task 8: Purchase Additional Software from HP and Third-Party Vendors An HP XC system supports the use of several additional HP and third-party software products. Use of these products is optional; the purchase and installation of these components is your decision and depends on your site's requirements. Product licensing requirements differ by product. “Task 4: Install Additional Software from Local Distribution Media” (page 40) describes the HP and third-party products you may need to install.
happens, the availability tool uses the IP alias to start the service on the second server in the availability set. Client nodes do not detect that the first server has gone down. 1.9.2 How to Configure Improved Availability Table 1-1 provides a summary of how to set up and configure improved availability of services. Detailed information or procedures are provided at the appropriate points in the system installation and configuration process, as noted in the second column of the table.
Availability Tools from Other Vendors If you prefer to use another availability tool, such as Heartbeat Version 1 or Version 2 (which is an open source tool), you must obtain the tool and configure it for use on your own. Third-party vendors are responsible for providing customer support for their tools. Installation and configuration instructions for any third-party availability tools you decide to use are outside the scope of this document. See the vendor documentation for instructions. 1.9.
In this release, improved availability is supported for the services listed in Table 1-2. Also listed in the table are things to consider about the role assignments if you plan to implement improved availability for one or more of these services. Read Appendix F (page 137) to learn more about default role assignments and the full set of services provided by each role.
Table 1-2 Role and Service Placement for Improved Availability (continued) Service Name Service is Delivered in This Role Special Considerations for Role Assignment Nagios master management_server By default, the management_server role is installed on the head node. If you want improved availability for Nagios, the management_server role must be assigned to two nodes, the head node and one additional node.
Table 1-3 Availability Sets Worksheet Availability Set Configuration First Node Name Second Node Name Availability Tool to Manage This Availability Set Roles to Assign to Nodes in the Availability Set First node in the availability set: • _________________________ • _________________________ • _________________________ • _________________________ • _________________________ Second node in the availability set: • • • • • _________________________ _________________________ _________________________ _______
2 Installing Software on the Head Node This chapter contains an overview of the software installation process and describes software installation tasks. These tasks must be performed in the following order: • “Task 1: Gather Information Required for the Installation” (page 35) • “Task 2: Start the Installation Process” (page 37) • “Task 3: Install Additional RPMs from the HP XC DVD” (page 39) • “Task 4: Install Additional Software from Local Distribution Media” (page 40) 2.
Table 2-1 HP XC Software Stack (continued) Software Product Name Description LSF-HPC with SLURM Platform's High Performance Computing version of LSF, LSF-HPC, has been integrated with SLURM in response to the growing need for a lightweight, powerful workload management system that is scalable and can support parallel, compute-intensive workloads across computing resources.
Table 2-2 Default Values in the ks.cfg File (continued) Item Default Value Language installed on the system U.S. English Desktop manager GNOME You can modify these values after the installation process is complete by using standard Linux system administration procedures. 2.1.4 Default File System Layout and Disk Partition Sizes Table 2-3 lists the default file system layout that is applied to the head node system disk.
Table 2-5 Information Required for the Kickstart Installation Session Item Description and User Action Disk for the installation During the installation process, a numbered list of disks discovered on the head node is displayed, and you are prompted to select a disk on which to install the software: Select the disk for the installation: The following criteria apply to the disk you select: • The disk must be 36 GB or larger. • The disk must not be connected to a SAN device.
Table 2-5 Information Required for the Kickstart Installation Session (continued) Item Description and User Action Time zone Select the time zone in which the system is located. The default is America/New York (Eastern Standard Time, which is Greenwich Mean Time minus 5 hours). Use the Tab key to move through the list of time zones, and use the Space bar to highlight the selection. Then, use the Tab key to move to OK, and press the space bar to select OK.
XC System Software Release Notes at http://www.docs.hp.com/en/highperfcomp.html to make sure no additional command-line options are required for the hardware model. Table 2-6 Kickstart Boot Command Line Cluster Platform or Hardware Model Chip Architecture Type Boot Command Line CP3000 and CP4000 Opteron and Xeon boot: linux ks=cdrom:/ks.
9. Log in as the root user when the login screen is displayed, and enter the root password you previously defined during the software installation process. 10. Open a terminal window when the desktop is displayed: a. Click on the Linux for High Performance Computing splash screen to close it. b. From the Applications menu, select the System Tools menu. c. Scroll down the list of options, and select Terminal to open a terminal window. 2.
# cd /media/cdrom/LNXHPC/RPMS 3. Find the Linux RPM you want to install and issue the appropriate command to install it. Depending on the RPM you want to install, you will issue a command similar to the following: # rpm -ivh rpm_name_version.noarch.rpm 4. Unmount the DVD: # cd # umount /dev/cdrom 2.
Deciding on the Method to Achieve Quorum for Serviceguard Clusters In a Serviceguard configuration, each availability set becomes its own two-node Serviceguard cluster, and each Serviceguard cluster requires some form of quorum. The quorum acts as a tie breaker in the Serviceguard cluster running on each availability set. If connectivity is lost between the nodes of the Serviceguard cluster, the node that can access the quorum continues to run the cluster and the other node is considered down.
HP recommends that you install additional software components now before the system is configured (the procedure described in Chapter 3 (page 43)) so that the software is propagated to all nodes during the initial image synchronization. This chapter does not contain product-specific information for third-party software products; see the documentation supplied by the vendor for product-specific information.
3 Configuring and Imaging the System This chapter contains an overview of the initial system configuration and imaging process and describes system configuration tasks, which must be performed in the following order: • “Task 1: Prepare for the System Configuration” (page 44) • “Task 2: Change the Default IP Address Base (Optional)” (page 49) • “Task 3: Run the cluster_prep Command to Prepare the System” (page 50) • “Task 4: Install Patches or RPM Updates” (page 52) • “Task 5: Run the discover Command to Dis
from root (/). The golden image is stored on the image server, which is also resident on the head node in this release.
Table 3-1 Information Required by the cluster_prep Command Item Description and User Action Node name prefix During the system discovery process, each node is automatically assigned an internal name. This name is based on a prefix defined by you. All node names consist of the prefix and a number based on the node's topographical location in the system. The default node prefix is the letter n.
Table 3-1 Information Required by the cluster_prep Command (continued) Item Description and User Action IPv6 address Provide the IPv6 address of the head node's Ethernet connection to the external network, if applicable. Specifying this address is optional and is intended for sites that use IPv6 addresses for the rest of the network.
Table 3-2 Information Required by the discover Command Item Description and User Action Total number of nodes in this cluster Enter the total number of nodes in the system configuration that are to be discovered at this time. Make sure the number you enter includes the head node and all compute nodes. You are not prompted for this information if you are discovering a multi-region, large-scale system. If the hardware configuration contains HP server blades, you are not prompted for this information.
Table 3-3 Information Required by the cluster_config Utility Item Description and User Action Availability sets You are prompted to configure availability sets for improved availability of services if you have installed and configured an availability tool (such as Serviceguard) as described in “Task 9: Plan a Service Availability Strategy” (page 26).
Table 3-3 Information Required by the cluster_config Utility (continued) Item Description and User Action NAT configuration If you assigned the external role to the nodes in an availability set, you are prompted to specify how you want to handle improved availability for the nat service. You can choose between no improved availability or enabling improved availability through an availability tool. You are also prompted to enter an additional external IP address to use as an external alias.
otherBase = 172.23.0 netMask = 255.224.0.0 The following describes the parameters in this file: Base nodeBase cpBase swBase icBase otherBase netMask The common first octet that is used for all base IP addresses. The base IP address of all nodes. The base IP address of the console branch. The base IP address of the Ethernet switches. The interconnect switch modules are based off of this value, 172.20.66.*. The base IP address of the interconnect. The base IP address to be used for other devices.
Enter the prefix to assign to internal node names. The prefix can contain up to 6 alphanumeric characters, and the last character must be alphabetic. The default node prefix is the letter "n": Enter node naming prefix [n]: your_prefix 1 Enter the maximum number of nodes in this cluster [ ]: 16 Setting system name to n16 ...
c. d. Click on the Linux for High Performance Computing splash screen to close it. Open a terminal window from the Gnome desktop: i. From the Applications menu, select the System Tools menu. ii. Scroll down the list of options, and select Terminal to open a terminal window. 3.5 Task 4: Install Patches or RPM Updates For each supported version of the HP XC System Software, HP releases all Linux security updates and HP XC software patches on the HP IT Resource Center (ITRC) Web site.
5. 6. From the patch / firmware database page, select Linux under find individual patches. From the search for patches page, in step 1 of the search utility, select vendor and version, select hpxc as the vendor and select the HP XC version that is appropriate for the cluster platform. 7. In step 2 of the search utility, How would you like to search?, select Browse Patch List. 8. In step 4 of the search utility, Results per page?, select all. 9. Click the search>>> button to begin the search. 10.
LD [M] /usr/local/cmcluster/drivers/deadman.ko # make modules_install INSTALL /usr/local/cmcluster/drivers/deadman.ko 3. Create a new module dependency list, system.map: # depmod -a 3.6 Task 5: Run the discover Command to Discover System Components The next step in the configuration process is the discovery of all system components.
• • • • 5. used; the administration network and the interconnect share the same ports and switches. All other interconnect types (Myrinet, InfiniBand, and QsNetII) are discovered automatically based on the interconnect found on the head node; specific command options are not necessary for those interconnect types. HP recommends that you include the --verbose option because it provides useful feedback and enables you to follow the discovery process.
Opening /etc/hosts.new.XC Opening /etc/powerd.conf Building /etc/powerd.conf ... Querying cp-n13 Querying cp-n14 Querying cp-n15 4 done Attempting to start hpls power daemon ... done Waiting for power daemon ... done switchName necs1-1 switchIP 172.20.65.2 type 2650 switchName nems1-1 switchIP 172.20.65.1 type 2848 Attempting to power on nodes with nodestring 8n[13-15] Powering on all known nodes ... done Discovering Nodes... running port_discover on 172.20.65.
Head Node CP device type set to iLO Waiting for power daemon to restart... done 1 2 3 4 5 Enter the MAC address of the switch that is connected to the administration ports. Do not enter the MAC address of the switch connected to the console ports. Enter the password for the Root Administration Switch that you previously defined when you prepared the hardware. If you did not preset a password, press the Enter key.
a. b. Press Esc and Shift+9 to enter the command-line mode. Use the C[hange Password] option to change the console port password. The factory default password is admin; change it to the password of your choice. This password must be the same on every node in your system. Lights-Out> C Type the current password> admin Type the new password (max 16 characters)> your_password Retype the new password (max 16 characters)> your_password New password confirmed. Lights-Out> exit • For BMC Firmware Version 1.
Table 3-4 System Environment Setup Tasks Required Tasks Optional Tasks “Put the License Key File in the Correct Location (Required)” (page 59) “Install Additional Software over the Network” (page 61) “Configure Interconnect Switch Monitoring Line Cards (Required)” (page 59) “Create the /hptc_cluster File System ” (page 61) “Configure sendmail (Required)” (page 59) “Modify Workstation Model Names in the Database” (page 62) “Customize the Nagios Environment (Required)” (page 61) “Enable Software RAID
sendmail Configuration Requirements on an HP XC System Although Linux sendmail typically functions correctly as shipped, current XC host naming conventions cause sendmail to improperly identify itself to other mail servers. This improper identification can lead to the mail being rejected by the remote server. To remedy this issue, perform the following procedure on all nodes with an external connection that will send mail: 1. 2.
3.7.4 Customize the Nagios Environment (Required) Nagios is a highly customizeable system monitoring tool that you can tailor to specific installation and monitoring requirements. HP recommends that you consider certain aspects of the Nagios environment as part of the initial system setup to optimize the type of system events reported to you as well as the frequency of alerts.
Mount the /hptc_cluster File System on an HP SFS Server During the HP XC system configuration process, you might have decided to install the /hptc_cluster file system on an HP SFS server to provide failover capabilities. Thus, if you want this file system located on an HP SFS server, you must mount the mount the /hptc_cluster now. The SFS Client Installation and User Guide describes how to mount the /hptc_cluster file system on an HP SFS file system.
/etc/systemimager/systemimager.conf 2. Add the following line or lines to the bottom of the file to enable software RAID-0 or RAID-1. Replace node_prefix[n-n] with the node prefix and the range of nodes on which you want to enable software RAID (for example n[1-4,7,9]). You must include a space before and after the equal sign (=). SOFTWARE_RAID0_NODES = node_prefix[n-n] SOFTWARE_RAID1_NODES = node_prefix[n-n] 3. Save your change and exit the file. 3.7.10 Create Local User Accounts This task is optional.
Table 3-6 Default Client Node Partition Layout File System Name Size One swap partition Swap space is calculated based on the amount of memory on the node, and it is governed by minimum and maximum values you set (see Appendix E (page 131) for more information). By default, swap space is calculated as 100% x (total memory).
Example 3-1 Sample mcs.ini File # mcs.ini file for cluster penguin [global] mcs_units=mcs1,mcs2,mcs3,mcs4,mcs5 mcs_server_units=mcs5 [mcs1] name=mcs1 ipaddr=172.23.0.1 location=Cab CBB1 nodes=n[1-36] status=offline [mcs2] name=mcs2 ipaddr=172.23.0.2 location=Cab CBB2 nodes=n[37-72] status=offline [mcs3] name=mcs3 ipaddr=172.23.0.3 location=Cab CBB3 nodes=n[73-108] status=offline [mcs4] name=mcs4 ipaddr=172.23.0.4 location=Cab CBB4 nodes=n[109-144] status=offline [mcs5] name=mcs5 ipaddr=172.23.0.
# script your_filename 2. Change to the configuration directory: # cd /opt/hptc/config/sbin CAUTION: Make sure that no other terminal windows or shells are actively open in the /hptc_cluster directory. The cluster_config utility does not properly export the /hptc_cluster directory if any processes are using the directory when the cluster_config utility mounts the file system. 3. Begin the cluster configuration process: # ./cluster_config HP recommends that you back up the database before proceeding.
1. When the following prompt is displayed, define availability sets: Cluster Configuration - XC Cluster version HP XC V3.1 Availability tools: serviceguard (selected) Current availability sets: [E]dit Availability Sets, [P]roceed, [Q]uit: In this example, only one availability tool has been installed, Serviceguard. Thus, it is the preselected tool which means that any availability set you configure here will be managed by Serviceguard unless you select another availability tool. 2.
avail> delete n7 Availability Set serviceguard: (n7 n8) deleted. • List all availability sets: avail> list Availability tools: serviceguard (selected) Current availability sets: serviceguard: (n7 n8) 6. When you have finished associating nodes and availability tools into availability sets, enter the letter b or the word back to return to the main availability set menu. avail> back [E]dit Availability Sets, [P]roceed, [Q]uit: 7.
• The a[B]breviated output is similar to the following. Nodes with the same role assignments are condensed into one line item. Node: n16 location: Level 1 Switch 172.20.65.1, Port 42 CURRENT HEAD NODE Roles assigned: compute console_network disk_io external management_hub management_server resource_management External Ethernet name: penguin.southpole.com ipaddr: 192.0.2.0 netmask: 255.255.252.0 gateway: 192.0.2.
3. Do the following to determine whether or not you have to modify the default role assignments: a. See Appendix F (page 137) for a description of the default role assignments based on system size. Read about the special considerations regarding the default role assignments to determine if the role assignments are suitable for your environment. If you do not modify the role assignments, the system will be configured with the default assignments. b.
NOTE: Table 3-3 (page 48) describes each prompt and provides information to help you with your answers. 1. When you are prompted to enter the number of NFS daemons required on the system; accept the default value.
You must now specify the clock source for the server nodes. If the nodes have external connections, you may specify up to 4 external NTP servers. Otherwise, you must use the node's system clock. Enter the IP address or host name of the first external NTP server or leave blank to use the system clock on the NTP server node: IP_address Enter the IP address or host name of the second external NTP server or leave blank if you have no more servers: Enter Renaming previous /etc/ntp.conf to /etc/ntp.conf.bak 4.
2: serviceguard A choice of 'standard' (1) means no improved availability. Enter the number corresponding to the way to configure availability []: 2 8. Enable Web access to the Nagios monitoring application and create a password for the nagiosadmin user. This password does not have to match any other password on the system. In this example, the Nagios service has been configured with improved availability. Executing C50nagios gconfigure Availability can be configured for nagios in one of several ways.
2 or enter 'd' if you want to start again with the (d)efault configuration or leave blank if you want to use the current configuration: Interfaces over which traps will be accepted: loopback Admin [O]k, [R]especify Interfaces: O 11. Optionally configure a self-signed certificate for the Apache server.
13. Configure SLURM: Do you want to configure SLURM? (y/n) [y]: Your answer depends on the type of LSF you plan to install; do one of the following: • • If you intend to install LSF-HPC with SLURM or the Maui Scheduler, enter y. If you intend to install standard LSF do not install SLURM and enter n. If you are installing SLURM, define a SLURM user name and accept all default responses.
Do one of the following: • • To install LSF-HPC with SLURM or standard LSF, enter y or press the Enter key. Proceed to step 15. If you intend to install another job management system, such as PBS Professional (documented in Appendix L (page 177)) or the Maui Scheduler (Appendix M (page 183)) enter n. Proceed to step 20. If at a future time you want to install LSF, rerun the cluster_config utility, and answer y to this question.
17. Provide responses to install and configure LSF. This requires you to supply information about the primary LSF administrator and administrator's password. The default user name for the primary LSF administrator is lsfadmin. If you accept the default user name and a NIS account exists with the same name, LSF is configured with the existing NIS account, and you are not be prompted to supply a password. Otherwise, accept all default answers.
After setting up your LSF server hosts and verifying your cluster "hptclsf" is running correctly, see "/opt/hptc/lsf/top/6.2/lsf_quick_admin.html" to learn more about your new LSF cluster. ***Begin LSF-HPC Post-Processing*** Created '/hptc_cluster/lsf/tmp'... Editing /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf... Moving /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf to /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf.old.7858... Editing /opt/hptc/lsf/top/conf/lsf.conf... Moving /opt/hptc/lsf/top/conf/lsf.
Image replication environment configuration complete.
info: info: info: info: info: info: info: info: info: info: info: info: info: info: Executing C50nat nrestart Executing C51nrpe nrestart Executing C52snmp_traps nrestart Executing C66ibmon nrestart Executing C90slurm nrestart Executing C91swmlogger nrestart Executing C95lsf nrestart Executing C30syslogng_forward crestart Executing C35dhcp crestart Executing C50supermond crestart Executing C90munge crestart Executing C90slurm crestart Executing C95lsf crestart nconfig shut down NOTE: If necessary, see “Tro
(Required)” (page 59) for more information about obtaining and positioning the license key file if you have not already done so. 2. Use the startsys command to turn on power to all nodes, image the nodes, and boot the nodes. As shown in Table 3-9, the command-line options for the initial system image and boot depend upon the size of the system. See startsys( 8) for the complete list of command options.
Imaging: 15 nodes -> n[1-15] Progress: Flamethrower started: nodes waiting: 15 nodes -> n[1-15] *** Thu Sep 28 08:55:19 2006 Current statistics: Imaging: 15 nodes -> n[1-15] Progress: *** Thu Sep 28 08:58:19 2006 Current statistics: Imaging: 15 nodes -> n[1-15] Progress: Thu Sep 28 08:58:34 2006 Imaging completed; will be powered off: 2 nodes -> n[1-2] You must manually power off the following nodes: n1 Press enter after removing power from these nodes. continuing ........
*** Thu Sep 28 09:07:33 2006 Current statistics: Booted and available: 15 nodes -> n[1-15] Progress: Thu Sep 28 09:07:34 2006 startsys process exiting with code 0 6. See “Troubleshoot the Imaging Process” (page 192) if you encounter problems imaging nodes. Proceed to “Task 12: Perform Postconfiguration Tasks for the InfiniBand Interconnect”. 3.13 Task 12: Perform Postconfiguration Tasks for the InfiniBand Interconnect This task applies only if the system is configured with an InfiniBand interconnect.
Info: confirming the list of services to transfer to availability... Info: 'serviceguard' not running anywhere. Info: Starting transfer of services to ServiceGuard... prepForAvail: ========== Executing 'pdsh -S -w n7 '/opt/hptc/etc/nconfig.d/C50nat nxferto n7''... pdsh@n8: n7: RC=0 prepForAvail: ========== 'pdsh -S -w n7 '/opt/hptc/etc/nconfig.d/C50nat nxferto n7'' finished, exited with 0(0) prepForAvail: ========== Executing 'pdsh -S -w n8 '/opt/hptc/etc/nconfig.d/C50nat nxferto n8''...
cmruncl : to verify that no warnings occurred during startup. startAvailTool: ========== '/opt/hptc/availability/serviceguard/start_avail' finished, exited with 0(0) Info: Successful transfer of services to Serviceguard. After services and IP aliases are shut down, each availability tool is started. Then, the availability tool starts up the services and IP aliases it is managing. Proceed to “Task 15: Configure SNMP Trap Destination for Enclosures”. 3.
# grep nh /etc/hosts IP_address n16 nh hplsadm 2. n16.localhost.localdomain Configure the MCS devices to send their SNMP traps to the management server IP alias; use the IP alias obtained in the previous step: # mcs_webtool -H MCS_IP_address -p MCS_Admin_password \ -i -a IP_address -e IP_address Adding trap receiver 1: IP_address Enabling trap receiver 1: IP_address Trap Receiver Status for MCS at MCS_IP_address: Authentication Traps: Disable Trap Receiver 1: IP_address Enable Trap Receiver 2: 0.0.0.
Configured unknown node n14 with 1 CPU and 1 MB of total memory... After the node has been booted up, re-run the spconfig utility to configure the correct settings. 3. 4. If the system is using a QsNetII interconnect, ensure that the number of node entries in the /opt/hptc/libelanhosts/etc/elanhosts file matches the expected number of operational nodes in the cluster. If the number does not match, verify the status of the nodes to ensure that they are all up and running, and re-run the spconfig utility.
# which lsid /opt/hptc/lsf/top/6.2/linux2.6-glibc2.3-x86_64-slurm/bin/lsid This sample output was obtained from an HP ProLiant server. Thus, the directory name linux2.6-glibc2.3-x86_64-slurm is included in the path (the string x86_64 signifies a Xeonor Opteron-based architecture). The string ia64 is included in the directory name for HP Integrity servers. The string slurm exists in the path only if LSF-HPC with SLURM is configured.
4 Verifying the System and Creating a Baseline Record of the Configuration Complete the tasks described in this chapter to verify the successful installation and configuration of the HP XC system components. With the exception of the tasks that are identified as optional, HP recommends that you perform all tasks in this chapter.
# lsid Platform LSF 6.2, LSF_build_date Copyright 1992-2005 Platform Computing Corporation My cluster name is hptclsf My master name is n13 [root@n16 ~]# lshosts HOST_NAME type model n13 LINUX64 Itanium2 n16 LINUX64 Itanium2 n1 LINUX64 Itanium2 n2 LINUX64 Itanium2 n3 LINUX64 Itanium2 n4 LINUX64 Itanium2 n5 LINUX64 Itanium2 n6 LINUX64 Itanium2 n7 LINUX64 Itanium2 n8 LINUX64 Itanium2 n9 LINUX64 Itanium2 n10 LINUX64 Itanium2 n11 LINUX64 Itanium2 n12 LINUX64 Itanium2 2.
Verifying HP Serviceguard Enter the following commands to verify the installation and configuration of HP Serviceguard: 1. List the nodes that are members of availability sets: # shownode config hostgroups hostgroups: headnode: n16 serviceguard:avail1: n14 n16 In this example, one availability set, avail1, has been configured. 2.
• • CPU usage on all nodes except the head node (by default). Memory usage on all compute nodes but not the head node (by default). The OVP also runs the following benchmark tests. These tests compare values relative to each node and report results with values more than three standard deviations from the mean: • • • • LINPACK is a collection of Fortran subroutines that analyze and solve linear equations and linear least-squares problems.
4.4 Task 4: Use Nagios to View System Health Nagios is a system and network health monitoring application. It watches hosts and services and alerts you when problems occur or are resolved. HP recommends that you start up Nagios now to view the network and ensure it is up and running (that is, all components are in the green state).
5 Upgrading an HP XC System This chapter describes how to use the upgrade process to install HP XC System Software Version 3.1 on an HP XC system that is already running a previous version of the HP XC System Software.
5.1.2 Differences Between Major and Minor Upgrades The tasks you perform for a major and minor upgrade are essentially the same. The primary difference occurs in “Task 4: Upgrade Linux and HP XC RPMs” (page 100), where you use the upgraderpms command if you are performing a minor upgrade. The documentation clearly states where differences occur.
NOTE: Enter the following command if you are not sure what version of the HP XC System Software is installed on the system: # cat /etc/hptc-release 5.1.5 Upgrade Commands Table 5-4 lists the commands and utilities that are run as part of a software upgrade process. Table 5-4 Commands Used During the Upgrade Process Used During Used During Major Upgrade? Minor Upgrade? Command/Utility Name Description preupgradesys Prepares the system for the upgrade.
1. 2. 3. Notify users in advance that a software upgrade is planned. Because all nodes must be shut down, plan the upgrade for a time when there is the least amount of activity on the system. Install all available patch kits for the current release before upgrading to Version 3.1. For example, if your system is currently installed with Version 2.1, make sure you have installed Patch Kit 1 (PK0) and Patch Kit 2 (PK02) before proceeding with the upgrade.
c. Run the fuser command again to make sure no processes are using /hptc_cluster: # fuser -vm /hptc_cluster d. 5. Proceed to step 5 when you have verified that no processes are using /hptc_cluster. Stop the SFS service if HP SFS is in use: # service sfs stop 5.4 Task 3: Install the Upgrade RPM and Prepare the System Follow this procedure to install the upgrade Red Hat Package Manager (RPM), and run the preupgradesys script, which performs the necessary preprocessing to prepare the system: 1. 2.
The command output looks different depending upon the hardware model and the interconnect type. CAUTION: Do not proceed to the next step in the upgrade process if the output from the preupgradesys script indicates failures. If you cannot determine how to resolve these errors, contact the HP XC Support organization at the following e-mail address: xc_support@hp.com 9.
Table 5-6 Upgrade Boot Command Line Based on Cluster Platform Chip Architecture (continued) 3. Cluster Platform Chip Architecture Boot Command Line CP6000 Itanium (when the boot device was set through the preconfigured boot option, and the ELILO boot: prompt is displayed) ELILO boot: linux ks=cdrom:/ks_upgrade.cfg CP6000 Itanium (when the boot device was set through the EFI Shell option and the fs0:> prompt is displayed) fs0:> elilo linux ks=cdrom:/ks_upgrade.
5.6 Task 5: View the Results of the RPM Upgrade Follow this procedure to view the results of the RPM upgrade process and verify that it was successful: 1. View the log files to see the result of the RPM upgrade process. The log files you view depend upon the upgrade type: • Major upgrade a. Use the method of your choice to view the following log file, which contain the results of the Linux RPM upgrade process; this example uses the more command: # more /root/upgrade.
5. You might need to upgrade firmware according to Master Firmware List. The master firmware list for this release of the HP XC System Software is available at the following Web site: http://www.docs.hp.com/en/highperfcomp.html 6. If you want to configure eligible services for improved availability, you must install and configure an availability tool now.
Table 5-7 Files Containing User Customizations (continued) File Name Important Notes /opt/hptc/systemimager/etc/base_exclude_file.rpmsave /opt/hptc/systemimager/etc/*.conf.rpmsave /opt/hptc/config/*.rpmsave /opt/hptc/config/etc/*.rpmsave 4. Follow the same process to open and search the log files for customizations specific to Linux configuration files. The file you view depends on the type of upgrade you performed. Search the files for configuration files that have either an .rpmsave or an .
upgradesys output logged to /var/log/upgradesys/upgradesys.log CAUTION: Do not proceed to the next step in the upgrade process if the output from the upgradesys script indicates failures. If you cannot determine how to resolve these errors, contact the HP XC Support organization at the following e-mail address: xc_support@hp.com 2. 3. Review the /opt/hptc/systemimager/etc/base_exclude_file to determine if you want to exclude files from the golden image beyond what is already excluded.
7. Do the following when the cluster_config utility displays the command-line options menu: [L]ist Nodes, [M]odify Nodes, [A]nalyze, [H]elp, [P]roceed, [Q]uit: a. b. If you specified the --init option, use the [M]odify Nodes option to reassign any role assignments you customized in the previous release. For example, if the system configuration had login roles on one or more nodes, you must assign a login role on any node on which you want users to be able to log in.
5.10 Task 9: Image and Boot the System and Start Compute Resources Follow this procedure to image and boot all nodes after the upgrade and start LSF: 1. Use the following startsys command on systems with fewer than 300 nodes to image and boot all nodes. For larger hardware configurations, see the next step. # startsys --image_and_boot 2. Use the following startsys commands on systems with more than 300 nodes.
5.11 Task 10: Start Availability Tools After the Upgrade Perform this task only if you configured improved availability of services, regardless of the availability tool you are using. Bypass this task if you did not configure availability sets. Run the transfer_to_avail command to shut down all services and IP aliases associated with services that are to be managed by an availability tool.
6 Reinstalling Version 3.1 This appendix describes how to reinstall HP XC System Software Version 3.1 on a system that is already running Version 3.1. Reinstalling an HP XC system with the same release may be necessary if you participated as a field test site of an advance development kit (ADK) or an early release candidate kit (RC).
# stopsys n[1-5] # startsys --image_and_boot n[1-5] The nodes automatically reboot when the reimaging is complete. 5. If SLURM is configured, reset the job state on nodes n1 through n5: # scontrol update NodeName=n[1-5] State=IDLE 6.2 Reinstall Systems with HP Integrity Hardware Models This section describes the following tasks: • “Reinstall the Entire System” (page 110) • “Reinstall One or More Nodes” (page 110) 6.2.
1. 2. Begin this procedure as the root user on the head node. Use the scontrol command to ensure that all jobs are drained from nodes n1 through n5: # scontrol update NodeName=n[1-5] State=DRAIN Reason="system shutdown" 3. Prepare all client nodes to network boot rather than boot from local disk: # setnode --resync --all 4.
A Installation and Configuration Checklist Table A-1 provides a list of tasks performed during a new installation. Use this checklist to ensure you complete all installation and configuration tasks in the correct order. Perform all tasks on the head node unless otherwise noted.
Table A-1 Installation and Configuration Checklist Description Reference Preparing for the Installation 1. Read related documents, especially the HP XC System Software Release Notes. “Task 1: Read Related Documentation” If the hardware configuration contains HP blade servers and enclosures, download (page 23) and print the HP XC Systems With HP Server Blades and Enclosures HowTo. 2. Plan for future releases. “Task 2: Plan for Future HP XC Releases” (page 23) 3.
Table A-1 Installation and Configuration Checklist (continued) Description Reference 19. Perform the following tasks to define and set up the system environment before the golden image is created: • Put the XC.lic license key file in the /opt/hptc/etc/license directory (required). • Configure interconnect switch line monitoring cards (required). • Configure sendmail (required). • Customize the Nagios environment (required). • Set the BMC/IPMI password on HP Integrity servers (required).
B Host Name and Password Guidelines This appendix contains guidelines for making informed decisions about information you are asked to supply during the installation and configuration process. This appendix addresses the following topics: • “Host Name Guidelines” (page 117) • “Password Guidelines” (page 117) B.1 Host Name Guidelines Follow these guidelines when deciding on a host name: • Host names can contain from 2 to 63 alphanumeric uppercase or lowercase characters (a-z, A-Z, 0-9).
C Enabling telnet on iLO and iLO2 Devices The procedure described in this appendix applies only to HP XC systems with nodes that use Integrated Lights Out (iLO or iLO2) as the console management device. New nodes that are managed with iLO or iLO2 console management connections that have never been installed with HP XC software may have iLO interfaces that have not been configured properly for HP XC operation.
2. 3. 4. Do one of the following: • If you cannot find an entry corresponding to the new node, check the network connections. Make repairs and rerun the discover command. • If you do find an entry corresponding to the new node, note the IP address on the line that begins with the string fixed-address, and proceed to step 3. Open a Web browser on the head node. In the Web address field at the top of the window, enter the IP address you noted in step 2 appended with /ie_index.htm: https://172.20.0.
hardware ethernet 00:11:0a:30:b0:bc; option host-name "cp-n3"; fixed-address 172.21.0.3; # location "Level 2 Switch 172.20.65.4, Port 3"; } host cp-n4 { hardware ethernet 00:11:0a:2f:8d:fc; option host-name "cp-n4"; fixed-address 172.21.0.4; # location "Level 2 Switch 172.20.65.4, Port 4"; } 2. 3. 4. Do one of the following: • If you cannot find an entry corresponding to the new node, check the network connections. Make repairs and rerun the discover command.
D Configuring Interconnect Switch Monitoring Cards You must configure the Quadrics switch controller cards, the InfiniBand switch controller cards, and the Myrinet monitoring line cards on the system interconnect to diagnose and debug problems with the system interconnect.
Table D-2 Quadrics Switch Controller Card Naming Conventions and IP Addresses for Full Bandwidth Number of Nodes Node-Level Switch Name Node-Level IP Address 1 to 64 QR0N00 QR0N00_S (P)1 172.20.66.1 Top-Level Switch Name Top-Level Switch IP Address Not applicable Not applicable 2 (S) 172.20.66.2 65 to 256 QR0N00 to QR0N03 (P) 172.20.66.1 to QR0N00_S to QR0N03_S 172.20.66.4 (S) 172.20.66.5 to 172.20.66.8 QR0T00 to QR0T01 257 to 512 QR0N00 to QR0N07 (P) 172.20.66.
2. STATIC 3. Abort Enter 1,2,3 and press return [2]: 2 Setting to STATIC Enter rail: 0 Enter type (N for Node, S for Supertop, T for Top) (q to abort): N Enter location (0-127) (q to abort): 0 Setting switch name to default: QR0N00 Setting IP address by switchname Enter IP address [default 172.20.66.1] q to abort: Enter Enter netmask address [default 255.255.0.0] q to abort: your_netmask Enter gateway address [default 172.20.0.254], q to abort: Enter Enter TFTP/RIS server IP address [default 172.20.0.
Table D-3 Myrinet Switch Controller Card Naming Conventions and IP Addresses Top-Level Switch Name Top-Level Switch IP Address 172.20.66.11 Not applicable Not applicable MR0N00 to MR0N02 172.20.66.1 to 172.20.66.3 MR0T00 to MR0T01 172.20.66.52 and 172.20.66.6 MR0N00 to MR0N03 172.20.66.1 to 172.20.66.4 MR0T00 to MR0T01 172.20.66.5 and 172.20.66.
} 5. Restart the DHCP service: # service dhcpd restart 6. Use the text editor of your choice to open the /etc/hosts file to include an entry for each monitoring line card, using the data in Table D-3 as a reference: 172.20.66.1 MR0N00 Make the entries above the following line in the file because any entries that follow this line will be deleted if you reconfigure the system: #XC-CLUSTER Do Not Edit Below this Line 7. 8. 9. Save your changes and exit the file.
login# admin Password# 123456 5. Access the enable mode with the default password (voltaire): ISR-9024# enable Password# voltaire 6. Change the default admin password: ISR-9024# password update admin 7. Change the default enable password: ISR-9024# password update enable 8. Access the configuration mode: ISR-9024# config 9. Access the interface fast mode: ISR-9024(config)# interface fast 10. Set the IP address of the switch and the netmask using the data in Table D-4 as a reference.
18. Set the switch time to be closely synchronized with the system time (within 1 minute). Replace MMDDhhmmYYYY with the actual system time: ISR-9024(config)# exit ISR-9024# clock set MMDDhhmmYYYY For example, to set the system date and time to 2:42 p.m. on September 29, 2006, enter the following command: ISR-9024# clock set 092914422006 19. Reset the switch to save the settings you just made: ISR-9024# reload 20. Log in as the root user on the head node. 21.
E Customizing Client Node Disks Use the information in this appendix to customize the disk partition layout on client node disk devices. This appendix addresses the following topics: • “Overview of Client Node Disk Imaging” (page 131) • “Configure Disks Dynamically” (page 131) • “Configure Disks Statically” (page 134) E.1 Overview of Client Node Disk Imaging The HP XC client node imaging process requires a single system disk on each client node on which the operating system is installed.
• /opt/hptc/systemimager/etc/make_partitions.sh Identifies the client disk type and size and creates the default partition table. When changing the default sizes of partitions or swap space, you will edit this file to effect the change. Read the comments in the file for more details. See the example in “Example 1: Changing Default Partition Sizes and Swap Space for All Client Nodes” (page 132) for information about how to make such a change.
# MEM_PERCENTAGE="1" 4. Change the MEM_PERCENTAGE variable to 1.5 to create a swap partition that is 1.5 times the size of physical memory. This will only be effective if the physical memory size is greater than 6 GB and less than 16 GB because swap partition size is bounded by these limits. MEM_PERCENTAGE="1.5" 5. 6. Save your changes to the file. Run the cluster_config utility, choosing the default answers, to create a new master autoinstallation script (/var/lib/systemimager/scripts/base_image.master.
BOOT_PERCENTAGE=".01" ROOT_PERCENTAGE=".49" VAR_PERCENTAGE=".50" 5. 6. Save your changes to the file. Identify the node names of the login nodes: # shownode servers lvs n[135-136] 7. Create a symbolic link from the node names of the login nodes to the newly created master autoinstallation script. Note that the node name is appended with a .sh extension: for i in n135 n136 do ln -sf login.master.0 $i.sh done 8.
By changing the contents of the appropriate .conf file, you can affect the disk configuration for a particular node or group of nodes by linking those nodes to the associated master autoinstallation script. You can also create your own .conf files and associated master autoinstallation script. For information about creating your own master autoinstallation script, see mkautoinstallscript(8) . E.3.
• • If the client nodes were not previously installed with the HP XC System Software, see “Task 11: Run the startsys Utility to Start the System and Propagate the Golden Image” (page 80) to continue the initial installation procedure.
F Node Roles, Services, and the Default Configuration This appendix addresses the following topics: • “Default Node Role Assignments” (page 137) • “Special Considerations for Modifying Default Node Role Assignments” (page 137) • “Role Definitions” (page 138) F.1 Default Node Role Assignments Table F-1 lists the default role assignments. The default assignments are based on the number of total nodes in the system.
F.2.2 Special Considerations for Systems with 63 or Fewer Nodes Before deciding whether or not you want to accept the default configuration for systems with 63 or fewer nodes, consider that a compute role is assigned to the head node by default. Therefore, when LSF users submit jobs, it is possible that the jobs run on the head node. In that situation, less than optimal performance is obtained if interactive users are also on the head node.
Because you cannot assign the node_management role to any other node except the head node, the avail_node_management role was developed to accomplish that task. Never assign the avail_node_management role to the head node. Assign the avail_node_management role only to the second node in an availability set to fail over the database server (dbserver) service. F.3.3 Common Role The common role is automatically assigned to all nodes, and it cannot be removed.
You can assign other roles to a node with this role. However, you must be careful not to overload the node so it can provide adequate NFS service. F.3.7 External Role The external role supplies the NAT server service, which does network address translation within the cluster. This enables applications to access nodes that do not have an external network connection. The configuration and management database name of the service supplied by this role is nat.
F.3.11 NIS Server Role The nis_server role is not enabled by default. Assigning this role to a node configures the node as a NIS slave server. If you assign this role to a node, you are prompted to enter the name of the NIS master server and NIS domain name during cluster_config processing. Any node assigned with the nis_server role must also have an external Ethernet network connection defined. nis is the configuration and management database of the service provided by this role. F.3.
G Using the cluster_config Command-Line Menu This appendix describes how to use the configuration command-line menu that is displayed by the cluster_config utility. This appendix addresses the following topics: • “cluster_config Command-Line Menu Overview” (page 143) • “List Node Configuration Information” (page 143) • “Modify Node Configuration” (page 143) • “Analyze Current Role Assignments Against HP Recommendations” (page 146) • “Customize Service and Client Configurations” (page 147) G.
[L]ist Nodes, [M]odify Nodes, [A]nalyze, [H]elp, [P]roceed, [Q]uit: m You are prompted to supply the node name of the node you want to modify. All operations you perform from this point are performed on this node until you specify a different node name. Please enter node name or [C]ancel: n15 Current Node: n15 [E]xternal Network Configuration, [R]oles, [H]elp, [B]ack: At this point you have the following options: • Enter the letter e to add, remove, or modify an external Ethernet connection on any node.
[E]dit Network Settings, [D]elete Network Settings, [H]elp, [B]ack: 5. After you have added the Ethernet connections, you have the option to do the following: • Enter the letter e to add an Ethernet connection on another node. • Enter the letter d to remove an Ethernet connection. • Enter the letter b to return to the previous menu. G.5 Modify Node Role Assignments The cluster configuration menu enables you to assign roles to specific nodes.
Roles to be assigned: compute disk_io login [R]eassign Roles, [O]k, [C]ancel : 5. Do one of the following: • Enter the letter o to accept the role assignments you just made to the node. • Enter the letter r to start the role assignment process again on this node. The Reassign Roles option does not apply the role assignments, it simply allows you to return to the list of roles and make adjustments to the role assignments.
1 1 N resource_management Note: n499 does not have external connection recommended by resource_management Role Rec: Role Recommended HN Req: Head Node Required HN Rec: Head Node Recommended Exc Rec: Exclusivity Recommended Ext Req: External Connection Required Ext Rec: External Connection Recommended Table G-3 Specific Node-By-Node Output of the Analyze Option Column Heading Description Recommend Displays the number of nodes recommended for a particular role based on the number of nodes in the syst
• • • Enter the letter s to perform customized services configuration on the nodes in the system. This option is intended for experienced HP XC administrators who want to customize service servers and clients. Intervention like this is typically not required for HP XC systems. See “Services Configuration Commands” (page 148) for information about each services configuration command. Enter the letter p to continue with the system configuration process.
Creating and Adding Node Attributes Using the previous two examples, enter the following commands to create and add node attributes: svcs> create na_disable_server.cmf Attribute "na_disable_server.cmf" created svcs> create na_disable_client.supermond Attribute "na_disable_client.supermond" created svcs> add na_disable_server.cmf n3 Attribute "na_disable_server.cmf" added to n3 svcs> add na_disable_client.supermond n1 Attribute "na_disable_client.
Table G-4 Service Configuration Command Descriptions (continued) Command Description and Sample Use [a]dd attribute_name node|node_list Adds a node attribute to a specific node or node list. The attribute must have been created previously. Node lists can take two forms: explicit (such as n1, n2, n3, n5) or condensed (such as n[1–3,5]) In the node list examples, the node prefix is the letter n. Replace n with the node-naming prefix you chose for your nodes. Sample use: svcs> add na_disable_client.
H Determining the Network Type The information in this appendix applies only to cluster platforms with a QsNetII interconnect. During the processing of the cluster_config utility, the swmlogger gconfig script prompts you to supply the network type of the system. The network type reflects the maximum number of ports the switch can support, and the network type is used to create the qsnet diagnostics database.
I LSF Installation Values This appendix lists the LSF values that were configured for the system during the HP XC System Software installation process. For more information about setting LSF environment variables and parameters listed in Table I-1, see the Platform LSF Reference manual.
Table I-1 Default Installation Values for LSF (continued) Environment Variable Value Description LSF_CLUSTERNAME hptclsf Defines the name by which LSF knows the HP XC system. You specify this user name during the LSF configuration process. Where is this Value Stored? install.config file SLURM Configuration 154 MaxJobCount 2000 Defines the maximum number of jobs that the slurmctld daemon can store in its memory at any one time. slurm.
J OVP Command Output This appendix provides command output from the OVP utility, which verifies successful installation and configuration of software and hardware components. # ovp --verbose XC CLUSTER VERIFICATION PROCEDURE Fri Sep 29 08:03:03 2006 Verify connectivity: Testing etc_hosts_integrity ... There are 47 IP addresses to ping. A total of 47 addresses were pinged. Test completed successfully. All IP addresses were reachable. +++ PASSED +++ Verify client_nodes: Testing network_boot ...
+++ PASSED +++ Verify license: Testing file_integrity ... Checking license file: /opt/hptc/etc/license/XC.lic +++ PASSED +++ Testing server_status ... Running verify_server_status Starting the command: /opt/hptc/sbin/lmstat Here is the output from the command: lmstat - Copyright (c) 1989-2004 by Macrovision Corporation. All right s reserved. Flexible License Manager status on Fri 9/29/2006 08:03 License server status: 27000@ n16 License file(s) on n16: /opt/hptc/etc/license/XC.
Here is the output from the command: Slurmctld(primary/backup) at n14/ n16 are UP/UP Checking output from scontrol. +++ PASSED +++ Testing partition_state ... Starting the command: /opt/hptc/bin/sinfo --all Here is the output from the command: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST lsf up infinite 14 idle n[3-16] Checking output from command. +++ PASSED +++ Testing node_state ...
Virtual hostname is lsfhost.localdomain Comparing ncpus from Lsf lshosts to Slurm cpu count. The Lsf and Slurm cpu count are in sync. +++ PASSED +++ Testing hosts_status ... Running 'bhosts -w'. Checking output from bhosts. Running 'controllsf show' to determine virtual hostname. Checking output from controllsf. Virtual hostname is lsfhost.localdomain Comparing MAX job slots from Lsf bhosts to Slurm cpu count. The Lsf MAX job slots and Slurm cpu count are in sync.
Checking contact groups... Checked 1 contact groups. Checking service escalations... Checked 0 service escalations. Checking service dependencies... Checked 168 service dependencies. Checking host escalations... Checked 0 host escalations. Checking host dependencies... Checked 0 host dependencies. Checking commands... Checked 57 commands. Checking time periods... Checked 4 time periods. Checking extended host info definitions... Checked 0 extended host info definitions.
Starting on lsfhost.localdomain All nodes have memory usage less than 25%. +++ PASSED +++ Testing cpu ... The headnode is excluded from the cpu usage test. Number of nodes allocated for this test is 13 Job 110 is submitted to default queue interactive . Waiting for dispatch ... Starting on lsfhost.localdomain Starting cpu test cpu test complete Detailed linpack results for each node can be found in /hptc_cluster/ovp/ovp_ n16_092906.tests/tests/100.perf_health/30.cpu if the --keep flag was specified.
Waiting for dispatch ... Starting on lsfhost.localdomain Starting Alltoall test Starting Allgather test Starting Allreduce test Testing complete. AlltoAll results summary (all values in micro seconds): min time: 40664.340000 max time: 41627.830000 median time: 41402.680000 mean time: 41241.710000 range: 963.490000 variance: 169757.663533 std_dev: 412.016582 AllGather results summary (all values in micro seconds): min time: 37469.500000 max time: 38768.920000 median time: 38751.300000 mean time: 38160.
Starting on lsfhost.localdomain [0: n3:1] ping-pong 7718.08 usec/msg 518.26 MB/sec [1: n4:2] ping-pong 7613.24 usec/msg 525.40 MB/sec [2: n5:3] ping-pong 7609.81 usec/msg 525.64 MB/sec [3: n6:4] ping-pong 7529.98 usec/msg 531.21 MB/sec [4: n7:5] ping-pong 7453.28 usec/msg 536.68 MB/sec [5: n8:6] ping-pong 7455.94 usec/msg 536.48 MB/sec [6: n9:7] ping-pong 7454.73 usec/msg 536.57 MB/sec [7: n10:8] ping-pong 7442.87 usec/msg 537.43 MB/sec [8: n11:9] ping-pong 7462.81 usec/msg 535.
K upgraderpms Command Output This appendix provides command output from the upgraderpms command, which is run during a minor software upgrade. # upgraderpms Command output is similar to the following: Use the upgraderpms utility only if you are performing a minor upgrade to install the new HP XC release on your system. Before running the upgraderpms utility, you must mount the new XC release DVD on the /mnt/cdrom directory and then use the cd command to go to that directory.
---> Package hptc_release.noarch 0:1.0-15 set to be updated ---> Downloading header for munge to pack into transaction set. ---> Package munge.ia64 0:0.4.2-1.3hp set to be updated ---> Downloading header for gcc-g77 to pack into transaction set. ---> Package gcc-g77.ia64 0:3.4.5-2 set to be updated ---> Downloading header for fonts-xorg-base to pack into transaction set. ---> Package fonts-xorg-base.noarch 0:6.8.2-1.EL set to be updated ---> Downloading header for gcc-java to pack into transaction set.
---> Downloading header for systemimager-common to pack into transaction set. ---> Package systemimager-common.noarch 0:3.4.1-28hp set to be updated ---> Downloading header for pdsh to pack into transaction set. ---> Package pdsh.ia64 0:2.10-4.1 set to be updated ---> Downloading header for syslog-ng to pack into transaction set. ---> Package syslog-ng.ia64 0:1.6.2-6 set to be updated ---> Downloading header for rpm-devel to pack into transaction set. ---> Package rpm-devel.ia64 0:4.3.
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> 166 Downloading header for slurm-sched-wiki to pack into transaction set. Package slurm-sched-wiki.ia64 0:1.0.
---> Package bzip2-libs.ia64 0:1.0.2-13.EL4.3 set to be updated ---> Downloading header for sys_check to pack into transaction set. ---> Package sys_check.ia64 0:1.0.2-20 set to be updated ---> Downloading header for libstdc++-devel to pack into transaction set. ---> Package libstdc++-devel.ia64 0:3.4.5-2 set to be updated ---> Downloading header for kickstart to pack into transaction set. ---> Package kickstart.
---> Downloading header for newt-devel to pack into transaction set. ---> Package newt-devel.ia64 0:0.51.6-7.rhel4 set to be updated ---> Downloading header for hptc-qsnet2-diag to pack into transaction set. ---> Package hptc-qsnet2-diag.noarch 0:1-19 set to be updated ---> Downloading header for ypserv to pack into transaction set. ---> Package ypserv.ia64 0:2.13-9.1hptc set to be updated ---> Downloading header for keyutils to pack into transaction set. ---> Package keyutils.ia64 0:1.
---> Package fonts-xorg-100dpi.noarch 0:6.8.2-1.EL set to be updated ---> Downloading header for evolution to pack into transaction set. ---> Package evolution.ia64 0:2.0.2-27.lnxhpc.1 set to be updated --> Running transaction check --> Processing Dependency: perl(Tk::Tree) for package: qsnet2libs --> Processing Dependency: perl(Tk::Label) for package: qsnet2libs --> Processing Dependency: libkeyutils.so.
linuxrpms 11 M file flamethrower fonts-xorg-100dpi fonts-xorg-75dpi fonts-xorg-base gaim gcc gcc-c++ gcc-g77 gcc-java gdb gdm linuxrpms 3.
libsoup ia64 2.2.1-4 linuxrpms 156 k libstdc++ ia64 3.4.5-2 linuxrpms 362 k libstdc++ i386 3.4.5-2 linuxrpms 279 k libstdc++-devel ia64 3.4.5-2 linuxrpms 9.7 M libuser ia64 0.52.5-1.el4.1 linuxrpms 453 k libuser-devel ia64 0.52.5-1.el4.1 linuxrpms 108 k linuxwacom ia64 0.7.0-EL4.1 linuxrpms 194 k lsf ia64 6.2-4hp hpcrpms 105 M lvm2 ia64 2.02.01-1.3.RHEL4 linuxrpms 1.4 M mdadm ia64 1.6.0-3.1hp hpcrpms 107 k module-init-tools ia64 3.1-0.pre5.3.2 linuxrpms 605 k modulefiles_hptc noarch 1.
(2.6.9-34.1hp.3sp.XCsmp) does not match modules (2.6.9-34.7hp.XCsmp) reboot with correct kernel to load QsNet modules warning: /etc/sysconfig/iptables.proto saved as /etc/sysconfig/iptables.proto.rpmorig Shutting down collectl: [ OK ] /opt/hptc/lib ia64 2:4.0.3-60.RHEL4 linuxrpms 712 k slurm ia64 1.0.15-1hp hpcrpms 2.9 M slurm-auth-munge ia64 1.0.15-1hp hpcrpms 8.7 k slurm-auth-none ia64 1.0.15-1hp hpcrpms 6.8 k slurm-devel ia64 1.0.15-1hp hpcrpms 204 k slurm-sched-wiki ia64 1.0.
Dependency Installed: Tk.ia64 0:804.027-1hp audit.ia64 0:1.0.12-1.EL4 hptc-supermon-modules-source.ia64 0:2-0.18 keyutils-libs.ia64 0:1.0-2 modules.ia64 0:3.1.6-4hptc Updated: IO-Socket-SSL.ia64 0:0.96-98 MAKEDEV.ia64 0:3.15.2-3 OpenIPMI.ia64 0:1.4.14-1.4E.12 OpenIPMI-libs.ia64 0:1.4.14-1.4E.12 autofs.ia64 1:4.1.3-169 binutils.ia64 0:2.15.92.0.2-18 bzip2.ia64 0:1.0.2-13.EL4.3 bzip2-devel.ia64 0:1.0.2-13.EL4.3 bzip2-libs.ia64 0:1.0.2-13.EL4.3 chkconfig.ia64 0:1.3.13.3-2 cpp.ia64 0:3.4.5-2 crash.ia64 0:4.0-2.
system-config-lvm.noarch 0:1.0.16-1.0 system-config-network.noarch 0:1.3.22.0.EL.4.2-1 system-config-network-tui.noarch 0:1.3.22.0.EL.4.2-1 system-config-printer.ia64 0:0.6.116.5-1 system-config-printer-gui.ia64 0:0.6.116.5-1 systemconfigurator.noarch 0:2.2.2-5hp systemimager-client.noarch 0:3.4.1-28hp systemimager-common.noarch 0:3.4.1-28hp systemimager-doc.noarch 0:3.4.1-28hp systemimager-flamethrower.noarch 0:3.4.1-28hp systemimager-ia64boot-standard.noarch 0:3.4.1-28hp systemimager-server.noarch 0:3.4.
---> Downloading header for Text-DHCPparse to pack into transaction set. ---> Package Text-DHCPparse.ia64 0:0.07-2hp set to be updated ---> Downloading header for hptc-avail to pack into transaction set. ---> Package hptc-avail.noarch 0:1.0-1.19 set to be updated ---> Downloading header for hptc-smartd to pack into transaction set. ---> Package hptc-smartd.noarch 0:1-1 set to be updated ---> Downloading header for collectl-utils to pack into transaction set. ---> Package collectl-utils.noarch 0:1.3.
Setting up Install Process Setting up repositories Reading repository metadata in from local files Parsing package install arguments Nothing to do Setting up Install Process Setting up repositories Reading repository metadata in from local files Parsing package install arguments Resolving Dependencies --> Populating transaction set with selected packages. Please wait. ---> Downloading header for qsnetdiags to pack into transaction set. ---> Package qsnetdiags.ia64 0:1.0.2-14.
L Installing and Using PBS Professional This appendix addresses the following topics: • “PBS Professional Overview” (page 177) • “Before You Begin” (page 177) • “Plan the Installation” (page 177) • “Perform Installation Actions Specific to HP XC” (page 177) • “Configure PBS Professional under HP XC” (page 178) • “Replicate Execution Nodes” (page 179) • “Enter License Information” (page 179) • “Start the Service Daemons” (page 180) • “Set Up PBS Professional at the User Level” (page 180) • “Run HP MPI Tasks”
a. b. c. d. Accept the default value offered for the PBS_HOME directory, which is /var/spool/PBS. When prompted for the type of PBS installation, select: option 1 (Server, execution and commands). If available, enter the license key during the interactive installation. Otherwise, you can execute the script named /usr/pbs/etc/pbs_setlicense on the PBS server node after the installation is complete. See “Replicate Execution Nodes” (page 179) for more information.
L.5.2 Remove Nodes from the SLURM or LSF Configuration To prevent SLURM or LSF from allocating jobs to PBS execution nodes, follow this procedure: 1. Remove the PBS execution nodes from all SLURM partitions specified in the /hptc_cluster/slurm/etc/slurm.conf file. See the HP XC System Software Administration Guide for details on configuring SLURM partitions. 2. Implement the changes: # scontrol reconfig # badmin reconfig L.5.
L.8 Start the Service Daemons Enter the following command to start the server, scheduler, and MOM daemons: # pdsh -w "x[n-n, N]" service pbs start In the previous command, the node list "x[n-n, N]" specifies the range of execution nodes (n-n), and also the PBS server node (N). For example, a valid node list is "n[1-49,100]" (the double quotation marks are required).
The PBS Professional documentation for the pbs_mpihp wrapper recommends replacing the HP MPI mpirun command with a symbolic link to pbs_mpihp to make the presence of PBS Professional completely transparent to HP MPI users. On systems where PBS Professional is the only active queuing system, this transparency might be desirable. However, on systems installed and configured with SLURM and PBS Professional, you must first determine if this configuration is appropriate for your needs.
M Installing the Maui Scheduler This appendix describes how to install and configure the Maui Scheduler software tool to interoperate with SLURM on an HP XC system. This appendix addresses the following topics: • “Maui Scheduler Overview” (page 183) • “Readiness Criteria” (page 183) • “Before You Begin” (page 183) • “Installation Procedure” (page 184) • “Verify Successful Installation of the Maui Scheduler” (page 186) M.
Before you install the Maui Scheduler on an HP XC system, you must be sure that the HP XC version of LSF-HPC with SLURM is not activated on the system. If LSF-HPC with SLURM is activated, you must deactivate it before proceeding. The following procedure describes how to determine if LSF-HPC with SLURM is activated and running on the system and how to deactivate it. Deactivating LSF consists of first stopping the LSF service and then disabling it. 1.
1. 2. Log in as the root user on the head node. Download the Maui Scheduler kit to a convenient directory on the system. The Maui Scheduler kit is called maui-3.2.6p9, and it is available at: http://www.clusterresources.com/products/maui/ M.4.2 Task 2: Compile the Maui Scheduler from Its Source Distribution To compile the Maui Scheduler from its source distribution, go to the directory where you downloaded the Maui Scheduler kit and enter the following commands: 1. .
M.4.4 Task 4: Edit the SLURM Configuration File Uncomment the following lines in the /hptc_cluster/slurm/etc/slurm.conf SLURM configuration file: SchedulerType=sched/wiki SchedulerAuth=42 SchedulerPort=7321 M.4.5 Task 5: Configure the Maui Scheduler After you install and set up the Maui Scheduler to interoperate with SLURM on HP XC, you must perform the standard Maui Scheduler configuration steps. The configuration is complicated and is beyond the scope of this document.
Req[0] TaskCount: 6 Partition: lsf Network: [NONE] Memory >= 1M Disk >= 1M Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] NodeCount: 1 Allocated Nodes: [n16:4][n15:2] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [lsf] Reservation '116' (00:00:00 -> 1:00:00 Duration: 1:00:00) PE: 6.00 StartPriority: 1 Table M-2 lists several commands that provide diagnostic information about various aspects of resources, workload, and scheduling.
N Troubleshooting This appendix addresses the following topics: • “Troubleshoot the Discovery Process” (page 189) • “Troubleshoot the Cluster Configuration Process” (page 192) • “Troubleshoot Licenses” (page 194) • “Troubleshoot the Imaging Process” (page 192) • “Troubleshoot OVP Results” (page 195) • “Troubleshoot the Software Upgrade Procedure” (page 196) N.1 Troubleshoot the Discovery Process Figure N-1 provides a high-level flowchart that illustrates the processing performed by the discover command.
http://www.docs.hp.com/en/highperfcomp.html The remainder of this section provides troubleshooting hints to help you solve some common problems that may occur during the discovery process.
NOTE: If the --oldmp option was used on the discover command line, it is assumed that all Management Processors (MPs) have their IP addresses set statically, and therefore are not subject to this step in the discovery process. If some console ports are not configured to use DHCP, they will not be discovered. Therefore, the first item to verify is whether or not the nondiscovered console ports are configured to use DHCP.
. . . In this case, a node is plugged into port 6 of the Branch Root switch at address 172.20.65.3. To resolve the discovery problem, examine this node to see what actions it is taking during power-on. Is it booting from the network? Is the proper network interface plugged into the switch? After these issues are resolved, run the discover command again. If discover encounters a port where a node is expected to be plugged in but is not found, a message similar to the following is displayed: Switch 172.20.65.
System imaging and node configuration information is stored in the following log files: • /hptc_cluster/adm/logs/imaging.log • /var/log/systemimager/rsyncd • /hptc_cluster/adm/logs/startsys.log Table N-1 (page 193) lists problems you might encounter as the golden image is being propagated to client nodes and describes how to diagnose and resolve the problem.
Table N-1 Diagnosing System Imaging Problems (continued) Symptom How To Diagnose Possible Solution An imaged node boots correctly, but the node hangs in the autoinstall script waiting for the first multicast operation. Verify that the node has started imaging by looking for “imaging_started” messages in the rsyncd log file. Verify that no “finished” messages are in the imaging.log file. • Ensure that startsys is was used to image the nodes.
No errors found Restart LIM on ...... Done N.5 Troubleshoot OVP Results The following list provides suggestions for troubleshooting OVP test failures: • The OVP issues a test failure if CPU usage on one or more nodes is found to be over 10 percent. In that case, use the top command to determine what processes are running and consuming CPU resources. • CPU usage might hit a spike if the system is in a metrics collection phase.
range: 7.390000 variance: 3.365941 std_dev: 1.834650 The following node(s) have values more than 3 standard deviations from the mean: node n16 has a value of 789.250000 --- FAILED --- N.6 Troubleshoot the Software Upgrade Procedure The following list provides suggestions for troubleshooting problems you might encounter when upgrading the HP XC System Software from a previous release to this release: • Look at the upgrade log files to determine if there were any upgrade failures.
To help you troubleshoot major software upgrades, Example N-1 (page 197) and Example N-2 (page 197) provide examples of successful and unsuccessful content in the postinstall.log file. Example N-1 Successful Content in /var/log/postinstall.log File Installing HP value add RPMS: info: Package flamethrower-0.1.6-1.noarch.rpm is already installed info: Package perl-XML-Simple-1.08-1.noarch.rpm is already installed Preparing...
Glossary A administration branch The half (branch) of the administration network that contains all of the general-purpose administration ports to the nodes of the HP XC system. administration network The private network within the HP XC system that is used for administrative operations. availability set An association of two individual nodes so that one node acts as the first server and the other node acts as the second server of a service. See also improved availability, availability tool.
external network node A node that is connected to a network external to the HP XC system. F fairshare An LSF job-scheduling policy that specifies how resources should be shared by competing users. A fairshare policy defines the order in which LSF attempts to place jobs that are in a queue or a host partition. FCFS First-come, first-served.
Integrated Lights Out See iLO. interconnect A hardware component that provides high-speed connectivity between the nodes in the HP XC system. It is used for message passing and remote memory access capabilities for parallel applications. interconnect module A module in an HP BladeSystem server.
MCS An optional integrated system that uses chilled water technology to triple the standard cooling capacity of a single rack. This system helps take the heat out of high-density deployments of servers and blades, enabling greater densities in data centers. Modular Cooling System See MCS. module A package that provides for the dynamic modification of a user's environment by means of modulefiles. See also modulefile.
PXE Preboot Execution Environment. A standard client/server interface that enables networked computers that are not yet installed with an operating system to be configured and booted remotely. PXE booting is configured at the BIOS level. R resource management role Nodes with this role manage the allocation of resources to user applications. role A set of services that are assigned to a node. Root Administration Switch A component of the administration network.
Index A adduser command, 63 administration network testing, 91 using as interconnect network, 54 administrator password ProCurve switch, 47 Anaconda kickstart, 96 Apache self-signed certificate, 49 configuring, 74 avail_node_management role, 138 availability role, 138 availability set defined, 26 availability sets configuring with cluster_config, 66 availability tool, 26 Heartbeat, 28 HP Serviceguard, 27 starting, 83 verifying operation, 90 B back up cmdb, 93 cmdb before cluster_config, 66 SFS server, 24 b
troubleshooting, 189 discover process troubleshooting, 189–190 disk configuration file, 134 disk partition layout, 35 layout on client nodes, 63 on client nodes, 131 size, 35 disk_io role, 139 distribution media, 33 DNS configuration, 24 DNS search path, 46 DNS server, 46 documentation additional publications, 21 compilers, 20 FlexLM, 19 HowTo, 17 HP XC System Software, 17 Linux, 20 LSF, 18 manpages, 21 master firmware list, 17 Modules, 19 MPI, 20 MySQL, 19 Nagios, 19 pdsh, 19 reporting errors in, 21 rrdtoo
I iLO device enabling telnet, 119 iLO2 device enabling telnet, 119 image server, 43 imaging, 43 monitoring, 194 troubleshooting, 192 imaging log file, 44 improved availability associating nodes in availability sets, 28 availability tools, 27 configure availability sets, 66 dbserver service, 29 defined, 26 eligible services, 29 Heartbeat, 28 HP Serviceguard, 27 installing Serviceguard RPM, 39 LVS director service, 29 Nagios master service, 30 NAT service, 30 of /hptc_cluster file system, 30 role assignment,
default installation values, 153 documentation, 18 does not start, 194 failover capability, 30 license troubleshooting, 194 LIM daemon, 87 lsfadmin user account, 63 postconfiguration tasks, 87 profile.lsf file, 87 sendmail, 59 testing, 91 user account, 63 verify configuration, 89 LSF configuration, 49 LSF-HPC with SLURM (see also LSF) defined, 33 verifying operation, 89 lsf.
O R -oldmp option, 47 operation verification program (see OVP) OVP, 91 log file, 92 sample command output, 155 troubleshooting, 195 reinstalling software, 109 release version, 97 reporting documentation errors feedback e-mail address for, 21 resource management role, 141 roles analyze current versus recommended, 146 assigning with cluster_config utility, 68 avail_node_management, 138 availability, 138 common, 139 compute, 139 console_network, 139 default role assignment, 137 defined, 138 disk_io, 139 ext
on enclosures, 85 on MCS devices, 85 snmptrapd service, 49, 73 software installing from local distribution media, 40 patch download site, 52 reinstalling, 109 software development tools, 42 software patches, 52 software RAID, 62 documentation, 20 enabling on client nodes, 62 mdadm utility, 20 mirroring head node, 36 software stack, 33 software upgrade (see upgrade) software version, 97 spconfig utility, 86 ssh configuring on InfiniBand switch, 83 ssh key, 48 standard LSF defined, 33 verify configuration, 89
X XC software version, 97 XC.