Setting up HP SIM 5.x on a Linux-based Serviceguard cluster What is HP SIM ................................................................................................................................... 2 HP SIM architecture..............................................................................................................................2 Setup process ......................................................................................................................................
What is HP SIM HP Systems Insight Manager (HP SIM) is an industry standard management tool for the management of all HP Systems; Servers and Storage. With HP SIM, you can manage various systems including ProLiant servers running Windows, Linux, and NetWare, HP Integrity and HP 9000 servers running HP-UX; HP Integrity Servers running Windows and Linux; and monitor Alpha servers running Tru64 UNIX and OpenVMS.
Figure 1 The two important pieces in HP SIM are the software and the database it maintains. On a Linux CMS, it can use either a PostgreSQL or an Oracle database. The HP SIM package consists of the following elements. HP SIM-Linux - HP Systems Insight Managert (C.05.0X.02.00) Automatic BIN file installation kit which will provide: hpsim-C.05.02.00.00-1 hpsmdb-server-8.2.1-1HPSIM Programs needed to create user-defined types and functions hpsmdb-libs-8.2.1-1HPSIM Essential shared libraries hpsmdb-8.2.
Configuration details Table 1 Name Public IP Address Heartbeat IP Adress SG1hpsimlnx (DL380 G2) 192.168.5.11 10.0.0.31 SG2hpsimlnx(BL20P G2) 192.168.5.12 10.0.0.32 SGhpsimlnx (cluster name) 192.168.5.10 SGQShpsimlnx (quorum server) 192.168.5.13 Each system of the cluster is running RHEL 4 Update 4. Installing the HP SIM binary on system1 and system2 Install HP SIM on each node in parallel. Download the binary distribution under /tmp, and then execute the following steps.
Ok 3. Test the prerequisites by executing the following command: # /opt/mx/bin/mxinitconfig –l Listing current status of server components (15): 1. Check Kernel Parameters ..OK Status : Unconfigured 2. Node Security File ..OK Status : Configured 3. Server Property File ..OK Status : Unconfigured 4. Server Authentication Keys ..OK Status : Unconfigured 5. SSH Keys ..OK Status : Unconfigured 6. Status Property File ..OK Status : Unconfigured 7. Task Results Output Cleanup ..OK Status : Unconfigured 8.
Requisite scan completed successfully. Configuring Server Components (15): 1. Check Kernel Parameters ..Done 2. Node Security File ..Done 3. Server Property File ..Done 4. Server Authentication Keys ..Done 5. SSH Keys ..Done 6. Status Property File ..Done 7. Task Results Output Cleanup ..Done 8. Database Configuration ....Done 9. Database Content ..Done 10. Web Server ..Done 11. Setup Property File ..Done 12. JBoss Setup ..Done 13. Agent Configuration ..Done 14. Management Services ..Done 15.
[root@SG1hpsimlnx ~]# mke2fs -m0 -j /dev/vgsg/lvsg mke2fs 1.35 (28-Feb-2004) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 2562240 inodes, 5117952 blocks 0 blocks (0.
# umount /hpsimlnx 13. For the second system, once the shared storage is available to it, do not move the content of the different directories, simply remove the content of them by running the following commands. Then create the symbolic links : # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # vgascan vgchange -a y mkdir /hpsimlnx mount /dev/vgsg/lvsg /hpsimlnx /etc/init.d/hpsim stop /etc/init.d/hpsmdb stop rm -f /etc/init.d/hpsim ln -sf /hpsimlnx/etc/init.d/hpsim /etc/init.
# cp $HOME/.ssh/id_da.pub $HOME/.ssh/authorized_keys # chmod 600 $HOME/.ssh/authorized_keys 2. Copy the SSH keys from node1 to node2 to ease further copy of configuration files between nodes # scp -rp $HOME/.ssh/ SG2hpsimlnx: 3. Install SGLX on system1. # ls -1 sg pidentd-3.0.15sg-1.i386.rpm serviceguard-A.11.18.00-0.product.redhat.i386.rpm # cd sg # rpm -ivh pidentd-3.0.15sg-1.i386.rpm serviceguard-A.11.18.000.product.redhat.i386.rpm 4. Install SGLX on system2. a.
Installing Serviceguard Manager on nodes The serviceguard Manager requires Serviceguard to run on the node. It cannot be installed on a node where serviceguard is not installed. The Serviceguard Manager requires java (jre-1_5_0_12-linuxi586.bin) Note: SMH version should be 2.1.7.168 or higher. # chmod 755 jre-1_5_0_12-linux-i586.bin # rpm –ivh jre-1_5_0_12-linux-i586.bin # rpm -ihv hpsmh-tomcat-1.0-11.linux.i386.rpm Preparing...
Configuring the cluster Managing SGLX authorizations # cat > /usr/local/cmcluster/conf/cmclnodelist << EOF SG1hpsimlnx root SG2hpsimlnx root SG1hpsimlnxp root SG2hpsimlnxp root EOF # chmod 600 /usr/local/cmcluster/conf/cmclnodelist Generating the cluster.conf file # cmquerycl –v –C $SGCONF/cluster.config –n node1 –n node2 –q Quorum # cat > $SGCONF/cluster.
HEARTBEAT_IP 10.0.0.32 # Cluster Timing Parameters (microseconds). # The NODE_TIMEOUT parameter defaults to 2000000 (2 seconds). # This default setting yields the fastest cluster reformations. # However, the use of the default value increases the potential # for spurious reformations due to momentary system hangs or # network load spikes. # For a significant portion of installations, a setting of # 5000000 to 8000000 (5 to 8 seconds) is more appropriate.
Check the syslog files on all nodes in the cluster to verify that no warnings occurred during startup. 2. Create the HP SIM package by launching HP Serviceguard Manager within the SMH and create the HP SIM package: Figure 2 3. Name the package, Select Nodes as shown and leave default parameters.
Figure 3 4. Select Subnets in Monitored Resources.
Figure 4 5. Edit control scripts for your package then save and distribute. Figure 5 # # # # # # # # # # # # @(#) A.11.18.00 Date: 03/15/07 $ ********************************************************************** * * * HIGH AVAILABILITY PACKAGE CONTROL SCRIPT (template) * * * * Note: This file MUST be edited before it can be used. * * * * You must have bash version 2 installed for this script to work * * properly. Also required is the arping utility available in the * * iputils package.
# The environment variables PACKAGE, NODE, SG_PACKAGE, # # # # # SG_NODE and SG_SCRIPT_LOG_FILE are set by Serviceguard at the time the control script is executed. Do not set these environment variables yourself! The package may fail to start or halt if the values for these environment variables are altered. # # # # # NOTE: Starting from 11.17, all environment variables set by Serviceguard implicitly at the time the control script is executed will contain the prefix "SG_".
# Leave the default, DATA_REP="none", if remote data replication is not used # or if the underlying file system is of type Red Hat GFS (Global File System). # # If remote data replication is used for the package application data, set # the variable DATA_REP to the data replication method. The current supported # methods are "clx", "clxeva" and "xdcmd".
# VG[0]="/dev/vgsg" # MULTIPLE DEVICES # Specify which md devices are used by this package. Uncomment MD[0]="" # and fill in the name of your first multiple device. You must begin # with MD[0], and increment the list in sequence. The md devices are # defined in the RAIDTAB file specified above. # # For example, if this package uses multiple devices md0 and md1, # enter: # MD[0]=/dev/md0 # MD[1]=/dev/md1 # # Multiple devices must not be set if the underlying file system is # Red Hat GFS.
# # Specify the filesystems which are used by this package. Uncomment # LV[0]=""; FS[0]=""; FS_TYPE[0]=""; FS_MOUNT_OPT[0]="" and fill in # the name of your first pool, filesystem, type and mount, # options for the file system. # You must begin with LV[0], FS[0], FS_TYPE[0], # FS_MOUNT_OPT[0] and increment the list in sequence. # # Valid types for FS_TYPE are 'gfs'. # # For example, if this package uses the following: # GFS6.0 uses pool for logical volume management whereas GFS6.1 uses LVM2.
# tuned carefully, increasing the values a little at a time and observing # # # # # # # # # the effect on the performance, and the values should never be set to a value where the performance levels off or declines. Additionally, the values used should take into account the node with the least resources in the cluster, and how many other packages may be running on the node.
# (NFS), Apache Web Server, and SAMBA (CIFS) Server. # # If you plan to use one of the HA server toolkits to run an application server, # you need to set the HA_APP_SERVER value to either "pre-IP" or "post-IP" in # order to enable this control script to check and run the Toolkit Interface # Script (toolkit.sh) in the package directory. The interface script will call # the toolkit main script to verify, start, and stop the server daemons.
test_return 51 } # This function is a place holder for customer define functions. # You should define all actions you want to happen here, after the service is # halted. # function customer_defined_halt_cmds { # ADD customer defined halt commands. : # do nothing instruction, because a function must contain some command.
# The DRCheckDiskStatus script has the exit values as follow: # # 0 - success; package starts # 1 - global error; package cannot start on any node in the cluster # 2 - local error; package cannot start on this node but allow to # start on other node in the cluster # exit_val=$? if [[ $exit_val -ne 0 ]] then if [[ $exit_val -eq 1 ]] then echo "ERROR: The package cannot $1 data replication on any node in the cluster." else echo "ERROR: The package cannot $1 data replication on this node.
function lvm_sanity_check { typeset vg=$1 # If lvm is lvm1 then just return. if (( LVM_VER == 1 )); then return 1 fi # Using lvm2. return 0 } # Function to add/remove tags for vgs. function vg_tag { typeset op=$1 typeset vg=$2 typeset tag=$3 typeset result if [[ x$op = xaddtag ]] || [[ x$op = xdeltag ]]; then echo "Attempting to $op to vg $vg..." result=$(vgchange --$op $tag $vg 2>&1) case $result in *Volume*group*successfully*changed*) echo "$op was successful on vg $vg.
#Note: This is an abnormal condition and cannot be determined # why this condition exists. A person had to have added # multiple tags for some reason and as a result Serviceguard # cannot assume why and thus cannot bring the package up # or remove the tags automatically. That has to be done by # a real person. if [[ "$hostid" = *,* ]]; then printf "ERROR: Volume Group $vg has multiple tags \"($hostid)\" defined.\n" printf " There cannot be more than one tag.
printf "In the event that \"$hostid\" is either powered off\n" printf "or unable to boot, then \"$vg\" must be forced\n" printf "to be activated on this node.\n\n" printf "******************* WARNING ***************************\n\n" printf "Forcing activation can lead to data corruption if\n" printf "\"$hostid\" is still running and has \"$vg\"\n" printf "active. It is imperitive to positively determine that\n" printf "\"$hostid\" is not running prior to performing\n" printf "this operation.
# and execute the appropriate commands. if (( LVM_VER == 2 )) then vgcfgrestore -ll -t ${vgname} >/dev/null 2>&1 test_return 27 vgcfgrestore ${vgname} >/dev/null 2>&1 else vgcfgrestore -n ${vgname} -ll -t >/dev/null 2>&1 test_return 27 vgcfgrestore -n ${vgname} -ll | \ awk '/PV Name/ {print $3}' | while read pvname do vgcfgrestore -n ${vgname} ${pvname} >/dev/null 2>&1 done fi fi activation_check $I test_return 53 echo "$(date '+%b %e %T') - Node \"$(hostname)\": Activating volume group $I .
(( UM_COUNT = $UM_COUNT + 1 )) fuser -kuv ${mount_pt} if (($UM_COUNT == $FS_MOUNT_RETRY_COUNT)) then mount ${fs_mount_opt} ${vol_to_mount} ${mount_pt} test_return 17 else mount ${fs_mount_opt} ${vol_to_mount} ${mount_pt} (( RET = $? )) if (( $RET == 0 )) then break else sleep 1 fi fi done } # For each {file system/device} pair, fsck the file system # and mount it.
reiserfs)fsck -a ${FS_TYPE_ARG[$R]} ${FS_FSCK_OPT[$R]} ${LV[$R]} if (( $? > 1 )) then # this will set $? to 1 let 0 test_return 2 fi ;; nfs) : # do nothing for nfs ;; *)fsck -p -T ${FS_TYPE_ARG[$R]} ${LV[$R]} if (( $? > 1 )) then # this will set $? to 1 let 0 test_return 2 fi ;; esac ) & # save the process id for monitoring the status pids_list[$j]="$!" (( j = j + 1 )) (( R = R + 1 )) done # wait for background fsck's to finish while (( j > 0 )) do pid=${pids_list[$j-1]} wait $pid if (( $? != 0 )) then let
# if there is permission to kill the user, we can # run fuser to kill the user, on the mount point.
YY=$( ifconfig | awk '$2 == "'addr:${I}'"') if [[ -z "${YY}" ]] then echo "$XX" >> $log_file echo "ERROR: Failed to add IP $I to subnet ${SUBNET[$S]}" (( error = 1 )) else echo "WARNING: IP $I is already configured on the subnet ${SUBNET[$S]}" fi fi fi (( S = $S + 1 )) done if (( error != 0 )) then # `let 0` is used to set the value of $? to 1. The function test_return # requires $? to be set to 1 if it has to print error message.
# Halt each service using cmhaltserv(1m). # function halt_services { for I in ${SERVICE_NAME[@]} do echo "$(date '+%b %e %T') - Node \"$(hostname)\": Halting service $I" cmhaltserv $I test_return 9 done } # For each IP address/subnet pair, remove the IP address from the subnet # using cmmodnet(1m).
typeset pids_list while (( L > 0 )) do j=0 while (( j < CONCURRENT_MOUNT_AND_UMOUNT_OPERATIONS && L > 0 )) do (( L = L - 1 )) K=${FS[$L]} I=${LV[$L]} mount | grep -e " "$K" " > /dev/null 2>&1 if (( $? == 0 )) then echo "$(date '+%b %e %T') - Node \"$(hostname)\": Unmounting filesystem on $K" ( result=$(umount ${FS_UMOUNT_OPT[$L]} $K 2>&1) ret=$? if (( ret != 0 )) then case $result in *not*mounted*) (( ret = 0 )) ;; *) echo "WARNING: Running fuser to remove anyone using the file system directly.
(( j = j - 1 )) done done } # This function deactivates volume groups # function deactivate_volume_group { typeset result for I in ${VG[@]} do echo "$(date '+%b %e %T') - Node \"$(hostname)\": Deactivating volume group $I" (( repeat=${#LV[*]}*2 )) while (( repeat > 0 )); do result=$(vgchange --test -a n $I 2>&1) case $result in *Can*t*deactivate*volume*group*with*open*logical*volume*) sleep 1 (( repeat = repeat - 1 )) echo "VG $I is busy, will try deactivation...
# This function will set variables to the required settings if # GFS is being used. function check_gfs { typeset -i i=0 typeset -i num=0 if [[ ${GFS} == "YES" ]]; then DATA_REP="none" unset RAIDTAB num=${#VG[@]} i=0 while (( i < num )) do unset VG[$i] (( i = i + 1 )) done num=${#MD[@]} i=0 while (( i < num )) do unset MD[$i] (( i = i + 1 )) done FS_UMOUNT_COUNT=1 FS_MOUNT_RETRY_COUNT=0 CONCURRENT_FSCK_OPERATIONS=1 fi } # END OF HALT FUNCTIONS. # FUNCTIONS COMMON TO BOTH RUN AND HALT.
to_exit=1 ;; 4) echo "ERROR: Function add_ip_address; Failed to add IP address to subnet" remove_ip_address if [[ "$HA_APP_SERVER" = "pre-IP" ]] || [[ "$HA_APP_SERVER" = "post-IP" ]] then verify_ha_server stop fi umount_fs deactivate_volume_group deactivate_md verify_physical_data_replication stop to_exit=1 ;; 8) echo "ERROR: Function start_services; Failed to start service ${SERVICE_NAME[$C]}" halt_services customer_defined_halt_cmds remove_ip_address if [[ "$HA_APP_SERVER" = "pre-IP" ]] || [[ "$HA_APP_SER
echo "ERROR: Function activate_md; Failed to activate $I" deactivate_md verify_physical_data_replication stop to_exit=1 ;; 26) echo "ERROR: Function deactivate_md; Failed to deactivate $I" exit_value=1 ;; 27) echo "ERROR: Function activate_volume_group; Failed to RUN vgcfgrestore $I" if [[ "$action" = "start" ]] then deactivate_md verify_physical_data_replication stop echo "###### Node \"$(hostname)\": Package start failed at $(date) ######" exit 1 fi exit_value=1 ;; 41) echo "ERROR: Function verify_physica
echo "ERROR: Function customer_defined_halt_cmds; Failed to HALT customer commands" exit_value=1 ;; 53) echo "ERROR: Function activation_check: Failed to activate $I" deactivate_md to_exit=1 ;; *) echo "ERROR: Failed; Unknown error.
activate_volume_group check_and_mount if [[ "$HA_APP_SERVER" = "pre-IP" ]] then verify_ha_server $1 fi add_ip_address if [[ "$HA_APP_SERVER" = "post-IP" ]] then verify_ha_server $1 fi customer_defined_run_cmds start_services # Check exit value if (( $exit_value == 1 )) then echo "###### Node \"$(hostname)\": Package start FAILED at $(date) ######" exit 1 else echo "###### Node \"$(hostname)\": Package start completed at $(date) ######" exit 0 fi elif [[ "$1" = "stop" ]] then echo -e "\n####### Node \"$(host
else echo "###### Node \"$(hostname)\": Package halt completed at $(date) ######" exit 0 fi fi 40
Figure 6 The package is created . A quick look in the following directory shows the different files generated by the HP Serviceguard Manager: # ls -ll total 188 -rw-r--r--rwx------rw-r--r-- 1 root root 1109 Jan 30 18:02 hpsim.config 1 root root 50320 Jan 30 22:45 hpsim.sh 1 root root 5736 Jan 30 23:37 hpsim.sh.log hpsim.config is the configuration file for the package hpsim.sh is the control script hpsim.sh.log is the log file for the control script file 6.
# # # # # chkconfig: - 99 01 description: Starts and stops the HPSIM + Hpsmdb backend daemon processname: sgsim pidfile: /hpsimlnx/var/opt/hpsmdb/data/sgsim.pid # Source function library. INITD=/etc/rc.d/init.d . $INITD/functions # Get function listing for cross-distribution logic. TYPESET=`typeset -f|grep "declare"` NAME=sgsim # Check that networking is up. # Pretty much need it for hpsmdb. [ "${NETWORKING}" = "no" ] && exit 0 # rebuild the ld.so cache in order to include # the dyn.
if [ $ret -eq 0 ]; then echo_success else echo_failure res=1 fi echo rm -f /var/lock/subsys/${NAME} exit $res } restart(){ stop start } # See how we were called. case "$1" in start) start ;; stop) stop ;; status) status sgsim ;; restart) restart ;; *) echo $"Usage: $0 {start|stop|status|restart}" exit 1 esac exit 0 7. For each node, create the following symbolic link # ln –sf /hpsimlnx/etc/init.d/sgim /etc/init.d The package can be started: 1. Select Package and then Administration / run package.
Figure 7 2. Sign in to HP SIM. Figure 8 3. Change SSL Certificate to match Cluster Name.
Figure 9 4. Change the certificate by clicking on New. Figure 10 a. # # # # For the primary node: /etc/init.
b. For the second system, once the shared storage is available to it, do not move the content of the sslshare directory, simply remove its content by running the following commands and create the symbolic link. # rm -rf /etc/opt/hp/ssslhare # ln -sf /hpsimlnx/etc/opt/hp/sslshare /etc/opt/hp 5. Restart HP SIM. # /etc/init.
Troubleshooting Reminder: Please note that process like mxdomainmgr might take some time to start (up to two minutes on minimum platform requirements based machine). As a consequence, do not try to logon into HP SIM right after Serviceguard Manager shows the package up and take these minutes into account when testing failover scenarios. Note also that the sleep timer (within the sgsim script) might need changes in order to fit the failover process.
For more information http:/www.hp.com/go/hpsim http://www.hp.com/go/sglx http://docs.hp.com/en/B9903-90054/ch01s01.html?btnNext=next%A0%BB http://docs.hp.com/en/B9903-90055/ch01s03.html#bgefghei (Serviceguard Manager Installation) © 2006-2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services.