White Paper Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage Technical White Paper Abstract This white paper describes the architecture, tuning guidelines, and performance of a high capacity, high-throughput, scalable BeeGFS file system solution.
Revisions Revisions Date Description July 2020 Initial release Acknowledgements Author: Nirmala Sundararajan The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license.
Table of contents Table of contents Revisions.............................................................................................................................................................................2 Acknowledgements .............................................................................................................................................................2 Table of contents .................................................................................................
Executive summary Executive summary In high-performance computing (HPC), designing a well-balanced storage system to achieve optimal performance presents significant challenges. A typical storage system consists of a variety of considerations, including file system choices, file system tuning, and disk drive, storage controller, IO cards, network card, and switch strategies.
BeeGFS High Capacity Storage solution overview 1 BeeGFS High Capacity Storage solution overview The Dell EMC Ready Solutions for HPC BeeGFS Storage is available in three base configurations, small, medium and large. These base configurations can be used as building blocks to create additional flexible configurations to meet different capacity and performance goals as illustrated in Figure 1. Contact a Dell EMC sales representative to discuss which offering works best in your environment, and how to order.
BeeGFS High Capacity Storage solution overview The metadata component of the solution that includes a pair of metadata servers (MDS) and a metadata target storage array remains the same across all the configurations as shown in Figure 1. The storage component of the solution includes a pair of storage servers (SS) and a single storage array for the small configuration, while a medium configuration uses two storage arrays and a large configuration uses four storage arrays.
BeeGFS file system 2 BeeGFS file system This storage solution is based on BeeGFS, an opensource parallel file system, which offers flexibility and easy scalability. The general architecture of BeeGFS consists of four main services: management, metadata, storage, and client. The server components are user space daemons. The client is a patchless kernel module. An additional monitoring service is also available.
BeeGFS High Capacity Storage solution architecture 3 BeeGFS High Capacity Storage solution architecture Figure 2 shows the large configuration architecture with four PowerVault ME4084 storage arrays. Solution reference architecture—large configuration In Figure 2, the management server (the topmost server) is a PowerEdge R640. The MDS function is provided by two PowerEdge R740 servers. The MDS pair is attached to a PowerVault ME4024 through 12 Gb/s SAS links.
BeeGFS High Capacity Storage solution architecture 3.1 Management server The single management server is connected to the MDS pair and SS pair through an internal 1 GbE network. BeeGFS provides a tool called beegfs-mon that collects use and performance data from the BeeGFS services and stores the data in a timeseries database called InfluxDB. beegfs-mon provides predefined Grafana panels that can be used out of the box to extract and visualize this data.
BeeGFS High Capacity Storage solution architecture 3.3 Metadata targets The ME4024 array is fully populated with 24 x 960 GB SAS SSDs. An optimal way to configure the 24 drives for metadata is to configure twelve MDTs. Each MDT is a RAID 1 disk group of two drives each. Figure 4 shows how the MDTs are configured. Configuration of metadata Targets in the ME4024 storage array The metadata target is formatted with ext4 file system because ext4 performs well with small files and small file operations.
BeeGFS High Capacity Storage solution architecture 3.4 Storage servers Each SS in the SS pair is equipped with four dual-port 12 Gb/s SAS host bus adapters and one Mellanox InfiniBand HDR100 adapter to handle storage requests. Figure 5 shows the recommended slot assignments for the SAS HBAs as slots 1, 2, 4 a and 5. This allows the SAS HBAs to be evenly distributed across the two processors for load balancing. The Mellanox InfiniBand HDR100 HCA is installed in slot 8, which is a PCIe x16 slot.
BeeGFS High Capacity Storage solution architecture 3.5 Storage targets Figure 6 illustrates how each storage array is divided into eight linear RAID 6 disk groups, with eight data and two parity disks per virtual disk. RAID 6 (8+2) LUNs layout on one ME4084 Each OST provides about 64 TB of formatted object storage space when populated with 8 TB HDDs.
BeeGFS High Capacity Storage solution architecture Component Specification Local disks and RAID controller Management server: PERC H740P Integrated RAID, 8GB NV cache, 6x 300GB 15K SAS hard drives (HDDs) configured in RAID10 MDS and SS servers: PERC H330+ Integrated RAID, 2x 300GB 15K SAS HDDs configured in RAID1 for OS InfiniBand HCA Mellanox ConnectX-6 HDR100 InfiniBand adapter External storage controllers On each MDS: 2 x Dell 12 Gb/s SAS HBAs On each SS: 4 x Dell 12 Gb/s SAS HBAs Object storage
Appendix A Storage array cabling 4 Performance evaluation Our performance studies of the solution uses Mellanox HDR100 data networks. Performance testing objectives were to quantify the capabilities of the solution, identify performance peaks, and determine the most appropriate methods for scaling. We ran multiple performance studies, stressed the configuration with different types of workloads to determine the limitations of performance, and defined the sustainability of that performance.
Appendix A Storage array cabling To prevent inflated results due to caching effects, we ran the tests with a cold cache. Before each test started, the BeeGFS file system under test was remounted. A sync was performed, and the kernel was instructed to drop caches on all the clients and BeeGFS servers (MDS and SS) with the following commands: sync && echo 3 > /proc/sys/vm/drop_caches In measuring the solution performance, we performed all tests with similar initial conditions.
Appendix A Storage array cabling Figure 7 shows the sequential N-N performance of the solution: Sequential N-N read and write As the figure shows, the peak read throughput of 23.70 GB/s was attained at 128 threads. The peak write was 22.07 GB/s at 512 threads. The single thread write performance was 623 MB/s and read performance was 717 MB/s. The read and write performance scale linearly with the increase in the number of threads until the system attained its peak.
Appendix A Storage array cabling The following figure shows the random read and write performance. Random N-N reads and writes As the figure shows, the write performance reaches around 31K IOPS and remains stable from 32 threads to 512 threads. In contrast, the read performance increases with the increase in the number of IO requests with a maximum performance of around 47K IOPS at 512 threads, which is the maximum number of threads tested for the solution.
Appendix A Storage array cabling N-1 Sequential Performance From the results we can observe that performance rises with the number of clients used and then reaches a plateau that is semi-stable for reads and writes all the way to the maximum number of threads used on this test. Therefore, large, single-shared file sequential performance is stable even for 512 concurrent clients. The maximum read performance was 22.23 GB/s at 256 threads. The maximum write performance of 16.54 was reached at 16 threads. 4.
Appendix A Storage array cabling MDtest files and directory distribution across threads: # of threads # of files per directory # of directories per thread Total number of files 1 1024 2048 2,097,152 2 1024 1024 2,097,152 4 1024 512 2,097,152 8 1024 256 2,097,152 16 1024 128 2,097,152 32 1024 64 2,097,152 64 1024 32 2,097,152 128 1024 16 2,097,152 256 1024 8 2,097,152 512 1024 4 2,097,152 Figure 10 shows file metadata statistics for empty files: Metadata performan
Appendix A Storage array cabling 4.2 Base configurations Figure 11 shows the measured sequential read and write performance of the Small, Medium and Large configurations (base configurations) of the Dell EMC Ready Solution for HPC BeeGFS High Capacity Storage. Throughput in GB/s BeeGFS IB Sequential Write vs. Read 24 22 20 18 16 14 12 10 8 6 4 2 0 1 2 4 8 16 32 64 128 256 512 Small-Write 0.46 1.30 2.76 4.93 5.32 5.74 5.74 5.73 5.59 5.67 Small-Read 0.51 1.36 1.64 2.84 4.35 6.
Appendix A Storage array cabling 4.3 Scalable configurations For the rest of the configurations, the performance numbers shown in Table 4 are estimates or extrapolations, since scaling up is linear with addition of ME4084 arrays and scaling down by removing arrays is assumed to be linear as well.
Appendix A Storage array cabling 4.4 Performance tuning Multiple parameters can be configured to achieve optimal system performance depending on intended workload patterns. This section shows the tuning parameters that we configured on the BeeGFS testbed system in the Dell HPC and AI Innovation lab. • Set the number of processes for the superuser to 50000 in order to improve performance.
Appendix A Storage array cabling • The following BeeGFS specific tuning parameters were used in the metadata, storage, and client configuration files: beegfs-meta.conf connMaxInternodeNum = 64 tuneNumWorkers = 12 tuneUsePerUserMsgQueues = true # Optional tuneTargetChooser = roundrobin (benchmarking) beegfs-storage.conf connMaxInternodeNum = 64 tuneNumWorkers = 12 tuneUsePerTargetWorkers = true tuneUsePerUserMsgQueues = true # Optional tuneBindToNumaZone = 0 tuneFileReadAheadSize = 2m beegfs-client.
Appendix A Storage array cabling 5 Conclusion The Dell EMC Ready Solution for HPC BeeGFS High Capacity Storage is a high-performance clustered file system solution that is easy to manage, fully supported, and capable of scaling both throughput and capacity. The solution includes the PowerEdge server platform, PowerVault ME4 storage products, and BeeGFS technology, the leading open-source solution for a parallel file system. The large size solution stack with 2.
Appendix A Storage array cabling 6 References The following Dell EMC documentation provides additional and relevant information. Access to these documents depends on your login credentials. If you do not have access to a document, contact your Dell EMC representative.
Appendix A Storage array cabling A Appendix A Storage array cabling This section presents how the metadata and storage servers are cabled to the PowerVault ME storage array. A.1 ME4024 cabling Table 5 shows the 12 Gb/s SAS cable connections between the MDS pair and one ME4024 that hosts the MDTs.
Appendix A Storage array cabling Figure 13 shows how the storage servers are cabled in the small configuration with a single SS pair (a pair of Dell EMC PowerEdge R740s), attached to a single fully-populated ME4084 array. Storage Cabling of Small Configuration with 1x ME4084 array Medium configuration The next size up from the small configuration is the medium configuration which uses a pair of storage servers (a pair of R740s) attached to two fully populated ME4084 arrays as shown in Figure 14.
Appendix A Storage array cabling The following table details the 12 Gb/s SAS cable connections between the SS pair and four ME4084 arrays: Cabling of storage servers to the ME4084 arrays Server SAS PCI Slot SAS port ME4084 array ME4084 Controller ME4084 Controller Port StorageA Slot 1 Port 0 ME4084 #1 Controller 0 Port 3 StorageA Slot 1 Port 1 ME4084 #2 Controller 0 Port 3 StorageA Slot 2 Port 0 ME4084 #2 Controller 1 Port 3 StorageA Slot 2 Port 1 ME4084 #1 Controller 1 Port 3 Stor
Appendix A Storage array cabling A simplified version of the storage cabling for the large base configuration is shown in Figure 16: Simplified depiction of cabling Storage Servers to the 4x ME4084 Storage Arrays 29 Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage | ID 424
Appendix B Benchmark command reference B Appendix B Benchmark command reference This section describes the commands that were used to benchmark the Dell HPC BeeGFS Storage solution. B.1 IOzone N-1 sequential and random IO We used the following commands to run sequential and random IOzone tests, the results of which are recorded in the performance evaluation section of this paper.
Appendix B Benchmark command reference B.2 MDtest: Metadata file operations We used the following command to run metadata tests, the results of which are recorded in the performance evaluation section of this paper. mpirun --allow-run-as-root -machinefile $hostlist --map-by node -np $threads $mdtest -v -d $working_dir -i ${repetitions} -b $nd -z 1 -L -I $nf -y -u -t -F The following table describes the MDtest command line options: MDtest command line options B.
Appendix B Benchmark command reference 32 Component Specification -e fsync—perform fsync upon POSIX write close -E useExistingTestFile—do not remove test file before write access -k keepFile—don’t remove the test file(s) on program exit -o testFile—full name for test -s segmentCount—number of segments Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage | ID 424