Technical White Paper Dell EMC Ready Solutions for HPC BeeGFS High Performance Storage Abstract This Dell EMC technical white paper discusses the architecture, scalability and performance of the Dell EMC Ready Solutions for HPC BeeGFS Storage.
Revisions Revisions Date Description January 2020 Initial release November 2020 Updated to address language-standards Acknowledgments This paper was produced by the following: Author: Nirmala Sundararajan — HPC and AI Innovation Lab Support: N/A Other: N/A The information in this publication is provided “as is.” Dell Inc.
Table of contents Table of contents 1 Introduction ...................................................................................................................................................................4 2 BeeGFS File System ....................................................................................................................................................5 3 Dell EMC BeeGFS Storage Solution Reference Architecture..............................................................
Introduction 1 Introduction High Performance Computing (HPC) is undergoing transformations that include storage systems technologies as a driving component. The first major leap in drive technology was the SSD, which owing to lack of moving parts delivered significantly better random access and read performance compared to HDDs.
BeeGFS File System 2 BeeGFS File System BeeGFS3 is an open source parallel cluster file system. The software can be downloaded from www.beegfs.io. The file system software also includes enterprise features such as High Availability, Quota enforcement, and Access Control Lists. BeeGFS is a parallel file system which distributes user data across multiple storage nodes.
BeeGFS File System BeeGFS Architecture Overview4 6 Dell EMC Ready Solutions for HPC BeeGFS High Performance Storage | ID 460
Dell EMC BeeGFS Storage Solution Reference Architecture 3 Dell EMC BeeGFS Storage Solution Reference Architecture Figure 2 shows the reference architecture of the solution. The management server is only connected using Ethernet to the metadata and storage servers. Each metadata and storage server have two InfiniBand links and are connected to the internal management network via Ethernet. The clients have one InfiniBand link and are connected to the internal management network using Ethernet.
Dell EMC BeeGFS Storage Solution Reference Architecture 3.2 Metadata Server A PowerEdge R740xd with 24x Intel P4600 1.6 TB NVMe drives is used for metadata storage. As the storage capacity requirements for BeeGFS metadata are small, instead of using a dedicated metadata server, only the 12 drives on NUMA zone 0 were used to host the Metadata Targets (MDTs), while the remaining 12 drives on NUMA zone 1 host Storage Targets (STs). Figure 3 shows the metadata server.
Dell EMC BeeGFS Storage Solution Reference Architecture 3.3 Storage Server Figure 5 shows the 5x PowerEdge R740xd servers used as storage servers. Dedicated Storage Servers Each storage server has six storage targets, 3 per NUMA zone. In total, there are 33 storage targets in the configuration. The targets are configured like the STs on the Metadata Server. The NVMe drives are configured in RAID 0 disk groups of 4 drives each and XFS is used as the underlying file system for the beegfs-storage services.
Dell EMC BeeGFS Storage Solution Reference Architecture 3.4 Clients Thirty-two C6420 servers were used as clients. The BeeGFS client module must be loaded on to all the hosts that need to access the BeeGFS file system. When the beegfs-client is loaded, it mounts the file systems defined in the /etc/beegfs/beegfs-mounts.conf file instead of the usual approach based on /etc/fstab. Adopting this approach starts the beegfs-client like any other Linux service through the service startup script.
Dell EMC BeeGFS Storage Solution Reference Architecture Component Details Mellanox EDR card 2x Mellanox ConnectX-5 EDR card (Slots 1 & 8) Out of Band Management iDRAC9 Enterprise with Lifecycle Controller Software Configuration (Metadata and Storage Servers) Component Details BIOS 2.2.11 CPLD 1.1.3 Operating System CentOS 7.6 Kernel Version 3.10.0-957.el7.x86_64 iDRAC 3.34.34.34 Systems Management Tool OpenManage Server Administrator 9.3.0-3407_A00 Mellanox OFED 4.5-1.0.1.
Dell EMC BeeGFS Storage Solution Reference Architecture 3.6 Advantages of using NVMe devices in the R740xd servers Figure 7 shows the storage layout in the Dell EMC PowerEdge Enterprise Servers. Storage Layout in the Dell EMC Enterprise Servers Use of the NVMe devices in the solution provides the following distinct advantages: 1.
Dell EMC BeeGFS Storage Solution Reference Architecture 3.7 R740xd, 24x NVMe Drives, Details on CPU Mapping In the 24xNVMe configuration of the PowerEdge R740xd server, there are two x16 NVMe extender cards connected to PCIe switches on the backplane which are connected to the drives. Each NVMe drive is connected to 4x PCIe lanes.
Performance Characterization 4 Performance Characterization To characterize the performance of the Ready Solution, the testbed used is shown in Figure 9. In the figure, the InfiniBand connections to the NUMA zone are highlighted. Each server has two IP links and the traffic through NUMA 0 zone is handled by interface IB0 while the traffic through NUMA 1 zone is handled by interface IB1.
Performance Characterization Client Configuration Component Details Clients 32x Dell EMC PowerEdge C6420 Compute Nodes BIOS 2.2.9 Processor 2x Intel Xeon Gold 6148 CPU @ 2.40GHz, 20 cores Memory 12x 16GB DDR4 2666 MT/s DIMMs - 192GB BOSS Card 2x 120GB M.2 boot drives in RAID 1 for OS Operating System Red Hat Enterprise Linux Server release 7.6 Kernel Version 3.10.0-957.el7.x86_64 Interconnect 1x Mellanox ConnectX-4 EDR card OFED Version 4.5-1.0.1.
Performance Characterization To minimize the effects of caching, OS caches were also dropped or cleaned on the client nodes between iterations as well as between write and read tests by running the command: # sync && echo 3 > /proc/sys/vm/drop_caches The default stripe count for BeeGFS is 4. However, the chunk size and the number of targets per file can be configured on a per-directory basis.
Performance Characterization The IOzone sequential write and read tests were performed three times and the mean value is plotted in Figure 10 above. We observe that peak read performance is 132 GB/s at 1024 threads and peak write is 121 GB/s at 256 threads. As per the technical specifications of the Intel P4600 1.6 TB NVMe SSDs, each drive can provide 3.2 GB/s peak read performance and 1.3 GB/s peak write performance, which allows a theoretical peak of 422 GB/s for reads and 172 GB/s for writes.
Performance Characterization The following commands were used to execute the benchmark for writes and reads, where threads was the variable with the number of threads used incremented in powers of two. The transfer size is 2M and each thread wrote to or read 128G from a single file striped over all the 33 targets. Three iterations of each test have been run and the mean value has been recorded. Figure 11 shows the N to 1 sequential I/O performance.
Performance Characterization 4.3 Random writes and reads N-N To evaluate random IO performance, IOzone was used in the random mode. Tests were conducted on thread counts starting from 4 threads to up to 1024 threads. Direct IO option (-I) was used to run IOzone so that all operations bypass the buffer cache and go directly to the devices. BeeGFS stripe count of 3 and chunk size of 2MB was used. A 4KiB request size is used on IOzone. Performance is measured in I/O operations per second (IOPS).
Performance Characterization allocated each thread on a different node, using round robin to spread them homogeneously across the 32 compute nodes. mpirun -machinefile $hostlist --map-by node -np $threads ~/bin/mdtest -i 3 -b $Directories -z 1 -L -I 1024 -y -u -t -F Since performance results can be affected by the total number of IOPs, the number of files per directory and the number of threads, consistent number of files across tests was chosen to be able to compare them on similar grounds.
Performance Characterization Metadata Performance (MDtest) - Empty Files 600000 4859… 471357 500000 471553 463318 465245 456819 448333 IOPS 400000 300000 200000 159734 172475185719 185973182858 108827 101401 108650 113308 115635 113920 110189 100000 0 161941 165912 86526 1 2 4 8 16 32 91473 96428 105422 72799 64 128 256 512 66262 79983 1024 2048 Number of concurrent threads Create Stat Removal Read Metadata performance with MDtest using empty files The system gets very good resu
Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage 5 Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage The Dell EMC solution uses dedicated storage servers and a dual-purpose metadata and storage server to provide a high-performance, scalable storage solution7. It is possible to scale the system by adding additional storage/metadata servers to an existing system.
Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage Capacity and Performance Details of Base Configurations Component Small Medium Total U (MDS+SS) 6U 12U # of Dedicated Storage Servers 2 5 # of NVMe Drives for data storage 60 132 1.6 TB 86 TiB 190 TiB 3.2 TB 173 TiB 380 TiB 6.4 TB 346 TiB 761 TiB Peak Sequential Read 60.1 GB/s 132.4 GB/s Peak Sequential Write 57.7 GB/s 120.7 GB/s Random Read 1.80 Million IOPS 3.54 Million IOPS Random Write 1.84 Million IOPS 3.
Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage Base Configurations and Scalable Configurations Small + 1 Medium + 1 Medium Small + 2 *MDS SS Small Scalable Configurations The metadata portion of the stack remains the same for all the above configurations described in this blog. This is because the storage capacity requirements for BeeGFS metadata are typically 0.5% to 1% of the total storage capacity.
Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage The testing methodology adopted is the same as that described in Section 4.1.
Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage Note: The storage pools referred to were created only for the explicit purpose of characterizing the performance of different configurations. While doing the performance evaluation of the medium configuration detailed in Section 4.1, all the 33 targets were in the "Default Pool" only.
Benchmarks and test tools A Benchmarks and test tools 1) The IOzone benchmark tool was used to measure sequential N to N read- and write throughput (GB/s) and random read- and write I/O operations per second (IOPS). 2) The IOR benchmark tool was used to measure sequential N to 1 read- and write throughput (GB/s). 3) The MDtest benchmark was used for files only (no directories metadata), to get the number of creates, stats, reads and removes the solution can handle when using empty files.
Benchmarks and test tools A.
Benchmarks and test tools BeeGFS Tuning Parameter Description A.3 -O String of IOR directives -o path to file for the test -s Segment count -g intraTestBarriers - use barriers between open, write/read, and close -t Transfer size -b Block size (amount of data for a process) MDtest MDtest is used with mpirun. For these tests, OpenMPI version < > was used.
Benchmarks and test tools A.4 Intel Data Center Tool The Intel Data Center Tool was used to reformat the Intel P4600 NVMe devices with 512b blocks for metadata and 4k blocks for storage as shown below: # On servers with meta on NUMA 0 and storage on NUMA 1 for i in 0 11 {16..23} 1 2 ; do isdct start -f -intelssd $i -nvmeformat LBAFormat=0 SecureEraseSetting=0 ; done for i in {3..10} {12..
Benchmarks and test tools nvme19 nvme20 nvme21 nvme22 nvme23 31 0000:ba:00.0 0000:bb:00.0 0000:bc:00.0 0000:bd:00.0 0000:be:00.
Technical support and resources B Technical support and resources 1 — NVM Express Explained: https://nvmexpress.org/wp-content/uploads/2013/04/NVM_whitepaper.pdf 2 — Dell EMC Ready Solutions for HPC BeeGFS Storage: https://www.dell.com/support/article/sln319381/ 3 — BeeGFS Documentation: https://www.beegfs.io/wiki/ 4 — General Architecture of BeeGFS File System: https://www.beegfs.io/docs/whitepapers/Introduction_to_BeeGFS_by_ThinkParQ.pdf 5 — ext4 file system for metadata targets: https://www.beegfs.