Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS This Dell technical white paper explains how to improve Network File System I/O performance by using Dell Fluid Cache for Direct Attached Storage in a High Performance Computing Cluster. Garima Kochhar Dell HPC Engineering March 2013, Version 1.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS This document is for informational purposes only and may contain typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any kind. © 2013 Dell Inc. All rights reserved. Dell and its affiliates cannot be responsible for errors or omissions in typography or photography. Dell, the Dell logo, PowerVault, and PowerEdge are trademarks of Dell Inc.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Contents Executive Summary ....................................................................................................... 5 1. Introduction ........................................................................................................... 6 1.1. 2. Solution design and architecture.................................................................................. 6 2.1. NFS storage solution (baseline) ..............
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Tables Table 1. NFS server and storage hardware configuration ...................................................... 8 Table 2. NFS server software and firmware configuration..................................................... 9 Table 3. Hardware configuration for DFC ....................................................................... 10 Table 4. Software and firmware configuration for DFC ..................................
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Executive Summary Most High Performance Computing clusters use some form of a Network File System (NFS) based storage solution for user data. Easy to configure and administer, free with virtually all Linux distributions, and well-tested and reliable, NFS has many advantages.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS 1. Introduction A Network File System (NFS) based storage solution is a popular choice for High Performance Computing Clusters (HPC). Most HPC clusters use some form of NFS irrespective of the size of the cluster. NFS is simple to configure and administer, free with virtually all Linux distributions, welltested, and can provide reliable storage for user home directories and application data.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS sections provide details on each of these components as well as information on tuning and monitoring the solution. 2.1. NFS storage solution (baseline) The baseline in this study is an NFS configuration. One PowerEdge R720 is used as the NFS server. PowerVault™ MD1200 storage arrays are direct-attached to the PowerEdge R720 and provide the storage. The attached storage is formatted as a Red Hat Scalable File System (XFS).
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Figure 2. Table 1. NFS server NFS server and storage hardware configuration Server configuration NFS SERVER PowerEdge R720 PROCESSORS Dual Intel(R) Xeon(R) CPU E5-2680 @ 2.70 GHz MEMORY 128 GB.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Table 2. NFS server software and firmware configuration Software OPERATING SYSTEM Red Hat Enterprise Linux (RHEL) 6.3z KERNEL VERSION 2.6.32-279.14.1.el6.x86_64 FILE SYSTEM Red Hat Scalable File System (XFS) 3.1.1-7 SYSTEMS MANAGEMENT Dell OpenManage Server Administrator 7.1.2 Firmware and Drivers BIOS 1.3.6 iDRAC 1.23.23 (Build 1) PERC H710/PERC H810 FIRMWARE 21.1.0-0007 PERC DRIVER megasas 00.00.06.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Table 3. Hardware configuration for DFC Server configuration NFS SERVER PowerEdge R720 CACHE POOL Two 350GB Dell PowerEdge Express Flash PCIe SSD SSD CONTROLLER Internal (slot 4) Rest of the configuration is the same as baseline, as described in Table 1 Storage configuration Same as baseline, as described in Table 1 Table 4. Software and firmware configuration for DFC Software CACHING SOFTWARE Dell Fluid Cache for DAS v1.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Table 5.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS OFED Mellanox OFED 1.5.3-3.0.0 Figure 3. Test bed 2.4. Solution tuning The NFS server and the attached storage arrays are configured and tuned for optimal performance. These options were selected based on extensive studies done by the Dell HPC team. Results of these studies and the tradeoffs of the tuning options are available in [4]. Additionally the DFC configuration was tuned based on experience gained from this study.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS 2.4.1. Storage • 3TB NL SAS disks are selected for large capacity at a cost-effective price point. • Virtual disks are created using a RAID 60 layout. The RAID 6 span is across 10 data disks and 2 parity disks and the stripe is across all four storage enclosures. This RAID configuration provides a good balance between capacity, reliability to tolerate multiple disk failures and performance.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS • NFSv3 is recommended over NFSv4 based on the performance results of a previous study.4 It was found that metadata create operations have significantly lower performance when using NFSv4. For environments where the security enhancements in NFSv4 are more important than performance considerations, NFSv4 can be used instead. 2.4.3.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS The warranty of the device is expressed in number of years and number of Petabytes written (PBW). For the recommended 350GB SSD drive, the standard warranty is 3 years, 25 PBW. The health of the device can be monitored using Dell OMSA utilities. OMSA reports the SSD “Device Life Remaining” and “Failure Predicted”.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS 2.5.3. Dell Fluid Cache for DAS health and monitoring DFC provides a very simple command-line utility /opt/dell/fluidcache/bin/fldc that can be used for configuration and management. Alternately, the DFC configuration can be accomplished using the OMSA GUI. DFC is a component under the storage sub-section of the OMSA GUI.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS • DFC in Write-Back mode (DFC-WB) – This configuration builds on the baseline by adding DFC as described in Section 2.2, and DFC is configured to operate in Write-Back (WB) mode. WB mode allows the caching of writes on the cache pool. WB mode requires the data to be written to a minimum of two PCIe SSDs. Both re-reads and writes are accelerated.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Figure 5. Large sequential write performance Sequential writes Throughput in MiB/s 2500 2000 1500 1000 500 0 1 2 4 8 16 32 48 64 Number of concurrent clients baseline Figure 6.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS better than the baseline since the data is already in the DFC cache. As expected on read operations, WB and WT tests have similar performance and can reach peak throughout of ~3050 MiB/s. 3.2. Random writes and reads Figure 7 plots the aggregate IOPs when a number of clients are simultaneously issuing random writes to their files. The baseline configuration can sustain ~1,600 IOPs on writes.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Figure 8. Random read performance Random reads 140000 120000 IOPS 100000 80000 60000 40000 20000 0 1 2 4 8 16 32 48 64 Number of concurrent clients baseline DFC-WB DFC-WT 3.3. Metadata tests This section presents the results of metadata tests using the mdtest benchmark. In separate tests, one million files were created, stated and unlinked concurrently from multiple NFS clients on the NFS server.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Figure 9. Metadata file create performance Number of create() ops per second File create 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 1 2 4 8 16 32 48 64 128 256 512 Number of concurrent clients baseline DFC-WB DFC-WT File create and file remove tests show similar results with the baseline out-performing the DFC configuration.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Figure 11. Metadata file remove performance Number of remove() ops per second File remove 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 1 2 4 8 16 32 48 64 128 256 512 Number of concurrent clients baseline DFC-WB DFC-WT 3.4.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Figure 12 shows that on a cold-cache read for sequential tests, the throughput of the DFC configuration drops from a peak of ~3,050 MiB/s to ~1,050 Mi/s. Data needs to be pulled from backend storage and hence the drop in performance. This is lower than the baseline throughput. Figure 12. Cold-cache sequential reads Cold-cache Sequential Reads 3500 Throughput in MiB/s 3000 2500 2000 1500 1000 500 0 1 2 4 DFC-WB Figure 13.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Figure 13 shows that on a cold-cache read for the random tests, peak IOPS of the DFC configurations drop from ~123,000 IOPs to ~80,000 IOPs. Interestingly this is higher than the baseline IOPs of ~9,300, and explained below. Figure 14 helps explain why the cold-cache read behavior is different for sequential and random I/O. It uses the output of fldcstat, a utility provided by DFC that displays statistics for the DFC configuration.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS 4. Conclusion This Dell technical white paper describes a method to improve NFS performance using Dell Fluid Cache for DAS in an HPC environment. It presents measured cluster-level results of several different I/O patterns to quantify the performance of a tuned NFS solution and measure the performance boost provided by DFC.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS http://en.community.dell.com/techcenter/systems-management/w/wiki/1760.openmanageserver-administrator-omsa.aspx 8. Dell PowerEdge Express Flash PCIe SSD www.dell.com/poweredge/expressflash http://support.dell.com/support/edocs/storage/Storlink/PCIe%20SSD/UG/en/index.htm http://content.dell.com/us/en/home/d/solutions/limited-hardware-warranties.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Appendix A: Step-by-step configuration of Dell Fluid Cache for NFS This appendix provides detailed step-by-step instructions on the configuration of the storage solution described in this white paper. Readers familiar with Dell’s NSS line of solutions will find the configuration steps to be very similar to the NSS recipe. Contents A.1. Hardware checklist and cabling ....................................................................
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Figure 15. Solution cabling A.2. NFS server set up After the PowerEdge R720 server is ready and cabled to the PowerVault MD1200, check Table 2 and Table 4 for details on the software used for the solution. 1. Create two virtual disks on the five internal disks of the PowerEdge R720. This can be done through the Ctrl+R menu on server boot-up. • One RAID 1 virtual disk on two drives. This will be used for the operating system.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS 5. Install the Red Hat Scalable File System (XFS) packages that are part of RHEL 6.3 add-on. xfsprogs-3.1.1-7.el6.x86_64 and xfsdump-3.0.4-2.el6.x86_64 6. Install at a minimum the “Server Instrumentation”, “Server Administrator Web Server” and “Storage Management” components of Dell OpenManage Server Administrator (OMSA) v7.1.2 on the PowerEdge R720. Note that only v7.1.2 supports DFC at the time of writing.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS 2. Change the OS I/O scheduler to “deadline”. To the end of the kernel line in /etc/grub.conf for the .14.1 errata kernel, add elevator=deadline. 3. To work around a known error message with the PCIe SSDs, add the following kernel parameter: to the end of the kernel line in /etc/grub.conf for the .14.1 errata kernel, add pci=nocrs. More details are available at https://bugzilla.kernel.org/show_bug.cgi?id=42606 and https://bugzilla.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Partitions Hot Spare Policy violated Encrypted Layout Size Associated Fluid Cache State Device Name Bus Protocol Media Read Policy Write Policy Cache Policy Stripe Element Size Disk Cache Policy : : : : : : : : : : : : : : Available Not Applicable No RAID-0 557.75 GB (598879502336 bytes) Not enabled /dev/sdb SAS HDD Adaptive Read Ahead Write Back Not Applicable 64 KB Disabled Now create a swap space on the RAID 0.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Status Name Slot ID State Firmware Version Minimum Required Firmware Version Driver Version Minimum Required Driver Version Storport Driver Version Minimum Required Storport Driver Version <…snip…> : : : : : : : : : : Ok PERC H810 Adapter PCI Slot 7 Ready 21.1.0-0007 Not Applicable 00.00.06.14-rh1 Not Applicable Not Applicable Not Applicable 2. Check that the PERC H810 Adapter has 48 3TB disks available.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Stripe Element Size Disk Cache Policy : 512 KB : Disabled A.5. XFS and DFC configuration In this final step of the configuration on the server, the XFS file system is created, DFC is configured, and the storage exported to the I/O clients via NFS. 1. Create the XFS file system on the RAID 60 virtual disk attached to the PERC H810 adapter. Note the stripe unit (su) and stripe width (sw).
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Example on the client: [root@compute-0-0 ~]# mount –o vers=3 :/home/xfs A.6. Useful commands and references DFC is installed in /opt/dell/fluidcache. 1. fldc is the command-line utility to configure DFC. Use fldc –h for the flags available. 2. To check status use fldc –-status. 3. Check for fldc events with fldc –-events. Use fldc –-num= --events to see more than last 10 events. 4.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Appendix B: Benchmarks and tests The iozone benchmark was used to measure sequential read and write throughput (MiB/sec) as well as random read and write I/O operations per second (IOPS). The mdtest benchmark was used to test metadata operation performance. B.1. IOzone You can download the IOzone from http://www.iozone.org/. Version 3.4.08 was used for these tests and installed the compute nodes.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS IOzone Argument Description -t Number of threads +m Location of clients to run IOzone on when in clustered mode -w Does not unlink (delete) temporary file -I Use O_DIRECT, bypass client cache -O Give results in ops/sec. For the sequential tests, file size was varied along with the number of clients such that the total amount of data written was 256G (number of clients * file size per client = 256G).
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS B.2. mdtest mdtest can be downloaded from http://sourceforge.net/projects/mdtest/. Version 1.8.3 was used in these tests. It was compiled and installed on a NFS share that was accessible by compute nodes. mdtest is launched with mpirun. For these tests, Intel MPI version 4.1.0 was used. One million files were created, stated and unlinked concurrently from multiple NFS clients on the NFS server.
Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS Metadata file and directory creation test: # mpirun -np 32 -rr --hostfile ./hosts /nfs/share/mdtest -d /nfs/share/filedir -i 6 -b 320 -z 1 -L -I 3000 -y -u -t -C Metadata file and directory stat test: # mpirun -np 32 -rr --hostfile ./hosts /nfs/share/mdtest -d /nfs/share/filedir -i 6 -b 320 -z 1 -L -I 3000 -y -u -t -R -T Metadata file and directory removal test: # mpirun -np 32 -rr --hostfile .