Dell HPC NFS Storage Solution High Availability Configurations A Dell Technical White Paper Garima Kochhar, Xin Chen, Onur Celebioglu Dell HPC Engineering Version 1.
Dell HPC NFS Storage Solution - High Availability Configurations THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. © 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Dell HPC NFS Storage Solution - High Availability Configurations Contents Executive Summary (updated May 2011) ........................................................................... 3 1. Introduction ....................................................................................................... 4 1.1. NSS Solution Offerings From Dell ........................................................................ 4 2. Dell NFS Storage Solution Technical Overview ...................................
Dell HPC NFS Storage Solution - High Availability Configurations A.9. Useful commands and references ..................................................................... 47 A.10. Performance tuning on clients (updated May 2011) ............................................... 48 A.11. Example scripts and configuration files .............................................................. 49 Appendix B: Medium to Large Configuration Upgrade .........................................................
Dell HPC NFS Storage Solution - High Availability Configurations Executive Summary (updated May 2011) This solution guide describes the Highly Available configurations of the Dell HPC NFS Storage Solution (NSS). Guaranteeing high availability (HA) of user data is becoming a common requirement in HPC environments. The HA configurations of Dell NSS improve availability of data using a pair of NFS gateway servers in an active-passive configuration to provide data access to the HPC compute cluster.
Dell HPC NFS Storage Solution - High Availability Configurations 1. Introduction Clusters have become one of the most popular architectures for High Performance Computing (HPC) today.(1) The disparity between the time taken by the processor and network to read or write data and the slower speed of storage devices makes the storage subsystem of a cluster, in most cases, a bottleneck that negatively impacts the overall system performance.
Dell HPC NFS Storage Solution - High Availability Configurations Small Medium • 20 TB of usable space. • QDR InfiniBand or 10Gb • 40 TB of usable space. • QDR InfiniBand or 10Gb Medium-HA • 40 TB of usable space. • QDR InfiniBand or 10Gb Ethernet Ethernet connectivity. Ethernet connectivity. • NFS Gateway Server: Dell • NFS Gateway Server: Dell • NFS Gateway Servers: Two Dell PowerEdge R710 running Red Hat Enterprise Linux 5.5 and XFS File System.
Dell HPC NFS Storage Solution - High Availability Configurations The NSS is available in three configurations – Small, Medium and Large. These correspond to a 20TB, 40TB and 80TB solution respectively. Figure 1 shows an NSS Medium configuration with two PowerVault MD1200s. Figure 1 - NSS-Medium Configuration The NSS offerings with HA (NSS-HA) extend the NSS solution by introducing a high availability feature.
Dell HPC NFS Storage Solution - High Availability Configurations Server Redundancy NSS-HA contains a pair of PowerEdge R710 servers. The two servers are configured in active/passive mode using the Red Hat Cluster Suite which will be described in later sections. In such a mode, when a server fails the other automatically takes over the service running on the failed server. Thus a single server failure does not cause loss of service, although a brief interruption (refer to Section 4.
Dell HPC NFS Storage Solution - High Availability Configurations Network redundancy The servers are connected to a Gigabit Ethernet switch that is used as the private network for communication between the active and passive servers. This network is used to monitor the heartbeat between the active and passive server. It is also used to communicate to the iDRAC and power PDUs. The servers are also connected to an InfiniBand or 10 Gigabit Ethernet network. This is referred to as the public network.
Dell HPC NFS Storage Solution - High Availability Configurations I/O path redundancy Each server is connected to both controllers on the MD3200. Each server has two SAS cables directly connected to the MD3200, which eliminates a single point of server to storage I/O path failure. A redundant path from MD3200 to the MD1200 array is deployed to enhance the availability of the storage I/O path. The storage cabling is shown in Figure 3. 3.2.
Dell HPC NFS Storage Solution - High Availability Configurations service impacting the availability of the entire system. In order to achieve better system availability, the NSS-HA solution extends NSS using the Red Hat Cluster Suite (RHCS). RHCS-based clustering includes a high availability feature. A cluster service is configured such that a failure of any cluster member or cluster component does not interrupt the service provided by the cluster.
Dell HPC NFS Storage Solution - High Availability Configurations can first determine that the “passive” server is not providing the service. This is done by rebooting or “fencing” the “passive” server. Since fencing is a critical component for the operation of the HA cluster, the NSS-HA solution includes two fence devices - the iDRAC and managed power distribution units (PDUs) – as previously described in the section on NSS-HA Hardware.
Dell HPC NFS Storage Solution - High Availability Configurations FAILURE TYPE MECHANISM TO HANDLE FAILURE Private switch failure Cluster service continues on the active server. If there is an additional component failure, service is stopped and system administrator intervention required. Heartbeat network interface failure Monitored by the cluster service. Service fails over to passive server. RAID controller failure on MD3200 storage array Dual controllers in MD3200.
Dell HPC NFS Storage Solution - High Availability Configurations There are two methods to upgrade the solution: 1) Add capacity, same performance. In this method user data is preserved during the upgrade. The final configuration provides 80TB of capacity, but the performance is similar to a Medium configuration. 2) Add capacity, improved performance In this second method all user data must be backed up. The upgrade will wipe out the existing Medium configuration and create a Large configuration.
Dell HPC NFS Storage Solution - High Availability Configurations 4.2. Test Bed (updated May 2011) The test bed used to evaluate the functionality and performance of the NSS-HA solution is described here. Figure 6 shows the test bed used in this study. Figure 6 – Test Bed Configuration Two PowerEdge R710 servers were used as the NFS gateway servers. Both servers were connected to PowerVault MD3200 storage extended with PowerVault MD1200 arrays.
Dell HPC NFS Storage Solution - High Availability Configurations The HPC compute cluster consisted of 64 PowerEdge R410 servers deployed using Platform Cluster Manager – Dell Edition version 2.0.1(6). Table 2, Table 3, Table 4, and Table 5 give details of the configuration. In this test bed, flow control was disabled on the PowerConnect 8024 switch and two PowerConnect 6248 switches listed in Table 5.
Dell HPC NFS Storage Solution - High Availability Configurations Table 4 - NSS-HA Firmware and Driver Configuration Details FIRMWARE AND DRIVERS PowerEdge R710 BIOS 2.2.10 PowerEdge R710 iDRAC 1.54 InfiniBand firmware 2.8.00 InfiniBand driver Mellanox OFED 1.5.1 10 Gigabit Ethernet driver ixgbe 2.0.44-k2 PERC H700 firmware 12.10.0-0025 PERC H700 driver megaraid_sas 00.00.04.17-RH1 6Gbps SAS firmware 07.01.33.00 6Gbps SAS driver mpt2sas 01.101.06.
Dell HPC NFS Storage Solution - High Availability Configurations 2) Virtual disks are created using RAID 6, with 10 data disks and 2 parity disks. This RAID configuration provides a good balance between capacity and reliability to tolerate multiple disk failures. (7) 3) Virtual disks are created with a segment size of 512k to maximize performance (7). This value should be set based on the expected application I/O profile for the cluster. 4) Cache block size is set to 32k to maximize performance.
Dell HPC NFS Storage Solution - High Availability Configurations 4) 5) 6) 7) Private switch failure Fence device failure One SAS link failure Multiple SAS link failures This section describes the NSS-HA response to failures. Details on how to configure the solution to handle these failure scenarios are provided in Appendix A: NSS-HA . Server response to a failure Server response was recorded in how the HA cluster responds to a failure event. Time to recover from a failure was used as a performance metric.
Dell HPC NFS Storage Solution - High Availability Configurations server by the Red Hat Service (resource group) Manager, rgmanager. Clients cannot access the data until the failover process is complete. When the active server boots up, it rejoins the cluster and the HA service remains running on the passive server. 2) Heartbeat link failure - simulated by disconnecting the private network link on the active server.
Dell HPC NFS Storage Solution - High Availability Configurations Once both servers are manually power cycled, they rejoin the cluster and one server takes ownership of the HA service to provide the file system to the clients. 5) Fence device failure - simulated by disconnecting the iDRAC cable from the server. If the iDRAC on a server fails, the server will be fenced via the network PDUs which are defined as secondary fence devices in the cluster configuration files.
Dell HPC NFS Storage Solution - High Availability Configurations longer to complete. In the 10 Gigabit Ethernet case, the client process takes 5-10% longer to complete. The actual additional time taken by the client process is of the order of minutes – one to three minutes. During the failover period when the data share is temporarily unavailable, the client process was observed to be in an uninterruptible sleep state.
Dell HPC NFS Storage Solution - High Availability Configurations As mentioned before, all performance benchmarking was done in a failure-free situation to understand the maximum capability of the solution. 5.1. InfiniBand Sequential Reads and Writes The results of the IPoIB sequential write tests are shown in 8. The figure shows the aggregate throughput that can be achieved when a number of clients are simultaneously writing to the storage over the InfiniBand fabric.
Dell HPC NFS Storage Solution - High Availability Configurations Figure 9 - IPoIB Large Sequential Read Performance IPoIB Large sequential reads 3000 Throughput: MB/sec 2500 2000 1500 1000 500 0 1 2 4 8 16 32 48 64 Number of clients Large Medium 5.2. 10 Gigabit Ethernet Sequential Reads and Writes (updated May 2011) For the 10 Gigabit Ethernet tests, flow control was disabled on the PowerConnect 8024 switch and two PowerConnect 6248 switches.
Dell HPC NFS Storage Solution - High Availability Configurations Figure 10 - 10GbE Large Sequential Write Performance 10GbE Large sequential writes 1400 Throughput: MB/sec 1200 1000 800 600 400 200 0 1 2 4 8 16 32 48 64 Number of clients Large Medium Figure 11 - 10GbE Large Sequential Read Performance 10GbE Large sequential reads 1400 Throughput: MB/sec 1200 1000 800 600 400 200 0 1 2 4 8 16 32 48 64 Number of clients Large Medium A single NFS client with a 10 Gigabit Ethernet conn
Dell HPC NFS Storage Solution - High Availability Configurations switch. Note that the network switch setting might vary for single client performance tests as opposed to multiple concurrent clients accessing the storage. Hence the network switch options should be tuned accordingly. These results are shown in Figure 12. The large sequential read throughput was 864 MB/sec for the Large configuration and 700 MB/sec for the Medium configuration from this 10GbE client.
Dell HPC NFS Storage Solution - High Availability Configurations Figure 13 - IPoIB Random Write Performance IPoIB Random Write Performance 2500 2000 IOPS 1500 1000 500 0 1 2 4 8 16 32 48 64 Number of clients Large Medium Figure 13 shows the IPoIB random write performance. The figure shows the aggregate I/O operations per second when a number of clients are simultaneously writing to the storage over the InfiniBand fabric.
Dell HPC NFS Storage Solution - High Availability Configurations Figure 14 - IPoIB Random Read Performance IPoIB Random Read Performance 7000 6000 IOPS 5000 4000 3000 2000 1000 0 1 2 4 8 16 32 48 64 Number of clients Large Medium 5.4. Metadata tests The results for the mdtest (file create, stat, remove) were very close for InfiniBand and 10 Gigabit Ethernet, with the average difference being less than 5%. Only the InfiniBand results are presented and discussed in this section.
Dell HPC NFS Storage Solution - High Availability Configurations Figure 15 - IPoIB File Create Performance IPoIB File Create Performance Number of create() per sec 25000 20000 15000 10000 5000 0 1 2 4 8 16 32 48 64 Number of clients Large Medium Figure 16 - IPoIB File Stat Performance Number of stat() per sec IPoIB File Stat Performance 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 1 2 4 8 16 32 48 64 Number of clients Large Medium Page 28
Dell HPC NFS Storage Solution - High Availability Configurations Figure 17 - IPoIB File Remove Performance IPoIB File Remove Performance Number of remove() per sec 35000 30000 25000 20000 15000 10000 5000 0 1 2 4 8 16 32 48 64 Number of clients Large Medium 6. Comparison of the NSS Solution Offerings This document describes a method to build on the existing non-HA NFS Storage Solution offered by Dell to provide High Availability features.
Dell HPC NFS Storage Solution - High Availability Configurations HARDWARE COMPONENT NSS without HA NSS with HA network Storage controller for local disks PERC H700, 5 local disks Public network to clients InfiniBand or 10 Gigabit Ethernet Table 7 - Comparison of Software Configuration SOFTWARE COMPONENT NSS without HA Operating System HA Cluster Software Red Hat Enterprise Linux 5.5 x86_64 None required Systems Management Storage Management NSS with HA Red Hat Cluster Suite for RHEL 5.
Dell HPC NFS Storage Solution - High Availability Configurations 2) Dell | Terascala HPC Storage Solution (DT-HSS) http://content.dell.com/us/en/enterprise/d/business~solutions~hpcc~en/Documents~Dellterascala-dt-hss2.pdf.aspx 3) Dell NFS Storage Solution for HPC (NSS) http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-NSS-NFS-Storagesolution-final.pdf 4) Red Hat Enterprise Linux 5 Cluster Suite Overview http://docs.redhat.
Dell HPC NFS Storage Solution - High Availability Configurations Appendix A: NSS-HA Recipe (updated May 2011) Sections A.1. Pre-install preparation .................................................................................. 32 A.2. Server side hardware set-up ........................................................................... 33 A.3. Initial software configuration on each PowerEdge R710 .......................................... 34 A.4. Performance tuning on the server ...............
Dell HPC NFS Storage Solution - High Availability Configurations A.2. Server side hardware set-up 1) Prepare two PowerEdge R710 servers (called “active” and “passive”). Configure each server as follows. One PERC H700 and 5 local disks each of 146 GB. Configure 2 disks in RAID 1 with 1 disk designated as the hot spare. This will be used for the operating system. Configure 2 disks in RAID 0, this will be used as swap. 10 Gigabit Ethernet card OR InfiniBand card in slot 4, a PCI-E x8 slot.
Dell HPC NFS Storage Solution - High Availability Configurations A.3. Initial software configuration on each PowerEdge R710 1) Install the RHEL5.5 x86_64 operating system on the RAID1 virtual disk. Make sure MD storage is not attached to the servers during the OS install. 2) After the OS is installed, setup swap on the RAID0 device.
Dell HPC NFS Storage Solution - High Availability Configurations Server Server repolist: 3,258 enabled: 3,116 4) Obtain the XFS packages from the Red Hat Network (http://rhn/redhat.com) and install them. xfsdump-2.2.48-3.el5.x86_64.rpm xfsprogs-devel-2.10.2-7.el5.x86_64.rpm xfsprogs-2.10.2-7.el5.x86_64.rpm 5) Install Dell OpenManage Server Administrator (OM-SrvAdmin-Dell-Web-LX-6.4.01266.RHEL5.x86_64_A00.21.tar.
Dell HPC NFS Storage Solution - High Availability Configurations # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost 15.15.10.1 active.hpc.com active 15.15.10.2 passive.hpc.com passive 13) Set up password-less ssh between the active and passive servers. active> ssh-keygen -t rsa active> ssh-copy-id –i ~/.ssh/id_rsa.pub passive passive> ssh-keygen -t rsa passive> ssh-copy-id –i ~/.ssh/id_rsa.
Dell HPC NFS Storage Solution - High Availability Configurations Make a backup of /etc/sysconfig/nfs and change the number of threads # cp /etc/sysconfig/nfs{,.orig} # sed -i 's/#RPCNFSDCOUNT=8/RPCNFSDCOUNT=256/' /etc/sysconfig/nfs Restart the NFS service. service nfs restart Reference - http://i.dell.com/sites/content/business/solutions/whitepapers/en/Documents/hpcpv-md1200-nfs.pdf 3) On both the active and the passive server, change the OS I/O scheduler to “deadline”.
Dell HPC NFS Storage Solution - High Availability Configurations 6) On each R710, cat /proc/partitions and multipath –ll should show all the LUNs on the storage. Reference http://support.dell.com/support/edocs/systems/md3200/en/OM/HTML/config_n.htm A.7. NSS HA Cluster setup In this recipe the term “cluster” refers to the active-passive NSS-HA Red Hat cluster. 1) On both R710s install the cluster software packages.
Dell HPC NFS Storage Solution - High Availability Configurations Set up HA LVM on the other (passive) server: lvdisplay should show DATA_LV Edit /etc/lvm/lvm.conf and edit locking_type to be 1 Edit /etc/lvm/lvm.conf and change the volume list to volume_list = [“VolGroup00” , “@passive” ] where VolGroup00 is the volume group that contains the “/” file system for the OS. “passive” is the name of the server as defined in the cluster.conf file that will be used for the cluster configuration..
Dell HPC NFS Storage Solution - High Availability Configurations Figure 18 - Create a New Cluster 7) On both servers, check that the cman service is running. # service cman status 8) On both servers, check the cluster status: # clustat Cluster status should show both servers online. If the cluster is up and running, go on to the next step.
Dell HPC NFS Storage Solution - High Availability Configurations 10) For InfiniBand clusters, copy the ibstat-script.sh file provided in Section A.11 to /root/ibstatscript.sh on both servers. This script checks the InfiniBand link status using the ibstat command. It is included as a resource of the cluster service to ensure that InfiniBand link is monitored. (For 10 Gigabit Ethernet clusters, RHCS monitors the 10GbE link and no additional scripts are needed). 11) Copy the sas_path_check_script.
Dell HPC NFS Storage Solution - High Availability Configurations Make this change in three places in the xml file for each host. PAGE 45Dell HPC NFS Storage Solution - High Availability Configurations Make this change in one place in the xml for each PDU g) PDU ports for the all four power supplies Make this change in two places in the xml for each of the four PDU ports PAGE 46Dell HPC NFS Storage Solution - High Availability Configurations k) NFS export options Make this change in one place in the xml. “sync” is a required option for NSS-HA. l) For InfiniBand clusters, location of the ibstat_script file. This location must be the same on both the servers. Make this change in one place in the xml.