Technical White Technical White Paper Dell EMC Ready Solution for HPC PixStor Storage Dell EMC HPC Solutions Abstract This white paper describes the architecture of the PixStor including its optional components for capacity expansion, NVMe tier and Gateways, along with performance characterization for the different components.
Revisions Revisions 2 Date Description July 2020 Initial release Dell EMC Ready Solution for HPC PixStor Storage | Document ID
Acknowledgements Acknowledgements Author: J. Mario Gallegos – HPC and AI Innovation Lab The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2020 Dell Inc.
Table of contents Table of contents Revisions.............................................................................................................................................................................2 Acknowledgements .............................................................................................................................................................3 Table of contents .................................................................................................
Table of contents Random small blocks IOzone Performance N clients to N files .................................................................................39 Metadata performance with MDtest using 4 KiB files.................................................................................................40 Summary ....................................................................................................................................................................
Executive summary Executive summary In High-Performance Computing (HPC), all its different components need to be balanced to keep optimal performance and avoid bottlenecks. Evolution of compute nodes can make storage a bottleneck, that is normally avoided by the use of Parallel File Systems (PFS) that can scale out to meet such demands. With recent advances in storage technologies like Non-Volatile Memory Express (NVMe) SSDs, more options are available and added to PFS storage system.
Solution Architecture Solution Architecture Introduction Today’s HPC environments have increased demands for very high-speed storage and with the higher count CPUs, faster networks and bigger and faster memory, storage was becoming the bottleneck in many workloads.
Solution Architecture Figure 1 Reference Architecture The main components of the PixStor solution are: Network Shared Disks (NSDs) Back end block devices (i.e. RAID LUNs from ME4 arrays, RAID 10 NVMeoF devices) that store information, data and metadata.
Solution Architecture Stores the file system data (MD4084) or metadata (ME4024). ME4084s are part of the Data Module and ME4024s are part of the optional High Demand Metadata Module inError! Reference source not found.. Expansion Storage Part of the optional Capacity Expansions (inside dotted orange square in Figure 1). ME484s connected behind ME4084s via SAS cables to expand the capacity of a Storage Module.
Solution Architecture Mellanox SB7800 Switches to provide high speed access via Infiniband (IB) EDR or 100 GbE. Solution Components This solution was planned to be released with the latest Intel Xeon 2nd generation Scalable Xeon CPUs, a.k.a. Cascade Lake CPUs and some of the servers will use the fastest RAM available to them (2933 MT/s).
Solution Architecture Solution Component At Release Test Bed Storage Node 2x Intel Xeon Gold 6230 @ 2.1GHz, 20 cores Memory NVMe Node 2x Intel Xeon Gold 5220 2.2G, Management Node 18C/36T, 10.4GT/s, 24.75M Cache, Turbo, HT (125W) DDR4-2666 Gateway/Ngenea High Demand Metadata 12 x 16GiB 2933 MT/s RDIMMs Storage Node (192 GiB) 2x Intel Xeon Gold 5118 @ 2.
Solution Architecture Gateways/Ngenea nodes. Similarly, slots 2 & 7 (x8) are only used for the Storage Servers for Large configurations (4 ME4084s) or for High Demand Metadata servers that require 4 ME4024s.
Solution Architecture Figure 3 SAS & High-speed cable diagram Storage configuration on ME4 arrays. The DellEMC Ready Solution for HPC PixStor Storage has two variants, the standard configuration and the one that includes the High Demand Meta Data module. On the standard configuration the same pair of R740 servers use their ME4084 arrays to store DATA on NLS SAS3 HDDs and Metadata on SAS3 SSDs. On Figure 4 we can see this ME4084 configuration showing how are the drives assigned to the different LUNs.
Solution Architecture operational and only a single SAS cable remains connected to each ME4084, the solution can still provide access to all data stored in those arrays. Figure 4 ME4084 drives assigned to LUN for Standard Configuration. When the optional High Demand Meta Data module is used, the eight RAID 6 are assigned just like the standard configuration and are also used only to store data.
Solution Architecture All the virtual disks on the storage module and HDMD module are exported as volumes that are accessible to any HBA port from the two R740s connected to them, and each R740 has one HBA port connected to each ME4 controller from their storage arrays. Such that even if one server is operational and only a single SAS cable remains connected to each ME4, the solution can still provide access to all data (or metadata) stored in those arrays.
Solution Architecture adapters, at least one of those adapters must be connected to the PixStor solution to get access to the file system and any information it has stored (two connections if redundancy is required on a single gateway). In addition, the gateways can be connected to other networks adding NICs supported by the PowerEdge R740 on the four x8 slots available (one x8 slot is used by a PERC adapter to manage local SSDs for the OS).
Solution Architecture Figure 7 PixStor Analytics - Capacity view Figure 8 provides a file count view with two very useful ways to find problems. The first half of the screen has the top ten users in a pie chart and top ten file types and top ten filesets (think projects) in pareto charts, all based on number of files. This information can be used to answer some important questions.
Solution Architecture Figure 8 PixStor Analytics - File count view 18 Dell EMC Ready Solution for HPC PixStor Storage | Document ID
Performance characterization Performance characterization Benchmarks selected and test beds To characterize the different components of this Ready Solution, we used the hardware specified in the last column of Table 1, including the optional High Demand Metadata Module.
Performance characterization Number of Client nodes 16 Client node Different 13G models with different CPUs and DIMMs Cores per client node 10-22, Total = 492 Memory per client node 8x128 GiB & 8x256GB, Total = 3TiB For testing, all nodes were counted as 256GiB (4 TiB). OS CentOS 8.1 OS Kernel 4.18.0-147.el8.x86_64 PixStor Software Spectrum Scale (GPFS) 5.1.3.1 5.0.4-3 OFED Version Mellanox OFED 5.0-2.1.8.
Performance characterization Figure 9 N to N Sequential Performance From the results we can observe that performance rises very fast with the number of clients used and then reaches a plateau that is stable until the maximum number of threads that IOzone allow is reached, and therefore large file sequential performance is stable even for 1024 concurrent clients.
Performance characterization The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.
Performance characterization Random small blocks IOzone Performance N clients to N files Random N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 1024 threads. This benchmark tests used 4 KiB blocks for emulating small blocks traffic. Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files two times that size.
Performance characterization Metadata performance with MDtest using empty files Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directories metadata), getting the number of creates, stats, reads and removes the solution can handle.
Performance characterization Figure 12 Metadata Performance - Empty Files First, notice that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a normal graph.
Performance characterization Metadata Performance (MDtest) - 4K Files 10000000 4,041,627 6,345,515 4,211,455 7,735,138 6,836,957 6,769,751 2,069,061 1,004,137 1000000 499,627 466,991 238,374 767,809 322,215 219,523 108,196 100000 44,071 IOPS 23,923 13,283 9,435 6,362 10000 2,698 2,230 1000 653 3,044 1,397 12,977 6,300 34,892 12,102 43,825 47,846 27,294 55,465 37,304 16,696 7,227 Create 2,302 Stat 869 434 Read 197 Removal 100 1 2 4 8 16 Threads 32 64 128 256 512 Fig
Performance characterization mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.
Performance characterization more storage nodes modules are added, and a similar performance increase can be expected from the optional high demand metadata module. Table 5 Peak & Sustained Performance Benchmark Peak Performance Sustained Performance Write Read Write Read Large Sequential N clients to N files 16.7 GB/s 23 GB/s 16.5 GB/s 20.5 GB/s Large Sequential N clients to single shared file 16.5 GB/s 23.8 GB/s 16.2 GB/s 20.5 GB/s Random Small blocks N clients to N files 15.8KIOps 20.
Performance characterization Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files bigger that two times that size. It is important to notice that for GPFS that tunable sets the maximum amount of memory used for caching data, regardless the amount of RAM installed and free.
Performance characterization Here it is important to remember that GPFS preferred mode of operation is scattered, and the solution was formatted to use such mode. In this mode, blocks are allocated from the very beginning of operation in a pseudo-random fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a smaller initial maximum performance, that performance is maintained fairly constant regardless of how much space is used on the file system.
Performance characterization Figure 16 N to 1 Sequential Performance From the results we can observe again that the extra drives benefit read and write performance. Performance rises again very fast with the number of clients used and then reaches a plateau that is fairly stable for reads and writes all the way to the maximum number of threads used on this test. Notice that the maximum read performance was 24.
Performance characterization Figure 17 N to N Random Performance From the results we can observe that write performance starts at a high value of 29.1K IOps and rises steadily up to 64 threads where it seems to reach a plateau at around 40K IOps. Read performance on the other hand starts at 1.4K IOps and increases performance almost linearly with the number of clients used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 25.
Performance characterization mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.
Performance characterization Metadata performance with MDtest using 4 KiB files This test is almost exactly identical to the previous one, except that instead of empty files, small files of 4KiB were used. The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.
Performance characterization limiting the performance to some degree. Since the inode size is 4KiB and it still needs to store metadata, only files around 3 KiB will fit inside and any file bigger than that will use data targets.
Performance characterization PixStor Solution – NVMe Tier This benchmarking was performed on four R640 NVMe nodes, each with eight Intel P4610 NVMe SSDs arranged as eight NVMe over Fabric RAID10 using NVMesh, as previously described in this document. Such RAID 10 were used as Block devices to create NSDs for data only, so the optional HDMD module (two R740) but using a single ME4024 array was used to store all the metadata.
Performance characterization Figure 20 N to N Sequential Performance From the results we can observe that write performance rises with the number of threads used and then reaches a plateau at around 64 threads for writes and 128 threads for reads. Then read performance also rises fast with the number of threads, and then stays stable until the maximum number of threads that IOzone allow is reached, and therefore large file sequential performance is stable even for 1024 concurrent clients.
Performance characterization Sequential IOR Performance N clients to 1 file Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from a one thread up to 512 threads since there were not enough cores for 1024 or more threads. This benchmark tests used 8 MiB blocks for optimal performance.
Performance characterization Figure 21 N to 1 Sequential Performance From the results we can observe read and write performance are high regardless of the implicit need for locking mechanisms since all threads access the same file. Performance rises again very fast with the number of threads used and then reaches a plateau that is relatively stable for reads and writes all the way to the maximum number of threads used on this test. Notice that the maximum read performance was 51.
Performance characterization ./iozone -i0 -I -c -e -w -r 8M -s ${Size}G -t $Threads -+n -+m ./nvme_threadlist ./iozone -i2 -I -c -O -w -r 4k -s ${Size}G -t $Threads -+n -+m ./nvme_threadlist Figure 22 N to N Random Performance From the results we can observe that write performance starts at a high value of 6K IOps and rises steadily up to 1024 threads where it seems reach a plateau with over 5M IOPS if more threads could be used.
Performance characterization Since 4KiB files cannot fit into an inode along with the metadata information, NVMe nodes will be used to store data for each file. Therefore, MDtest can give a rough idea of small files performance for reads and the rest of the metadata operations. The following command was used to execute the benchmark, where $Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.
Performance characterization The system gets very good results as previously reported with Stat operations reaching the peak value at 64 threads with almost 6.9M op/s and then is reduced for higher thread counts reaching a plateau. Create operations reach the maximum of 113K op/s at 512 threads, so is expected to continue increasing if more client nodes (and cores) are used.
Performance characterization server had either 128 GiB, or 256 GiB with a total of 3 TiB. However, to simplify testing and avoid caching, all clients were counted as having 256GiB. Regarding network connectivity, all clients have a Mellanox CX4 VPI adapter configured for 100 Gb Ethernet and connected using a Dell EMC Z9100 switch, so that the limited number of clients could provide the highest possible load for the Gateways.
Performance characterization implies each thread was running on a different client), the file size was fixed at twice the amount of memory per client, or 512 GiB. Even that for the PixStor native client the optimum block transfer size is 8 MiB, the block size for large sequential transfers was set to 1 MiB, since that is the maximum size used by NFS for reads and writes.
Performance characterization (odd numbered clients to first gateway and even numbered clients to the second gateway). This manual configuration was used to deterministically have half of the clients mounted from each Gateway. Sequential IOzone Performance N clients to N files Sequential N clients to N files performance was measured with IOzone version 3.487.
Performance characterization Figure 25 N to N Sequential Performance - SMB From the results we can observe that write performance rises steadily with the number of threads, almost reaching the plateau of about 19 GB/s with 16 threads and attaining the maximum performance of 20.2 GB/s at 512 threads. The read performance does not immediately rise as expected but stays almost flat at 2 threads.
Conclusion and Future Work Conclusion and Future Work The DellEMC Ready Solution for HPC PixStor Storage is a high-performance file system solution that is very efficient, easy to manage, fully supported, multi-tier, scalable in throughput and in capacity and with components to allow connecting it via standard protocols like NFS, SMB or Cloud.
References References Dell EMC Ready Solution for HPC PixStor Storage Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion Dell EMC Ready Solution for HPC PixStor Storage - NVMe Tier Dell EMC Ready Solution for HPC PixStor Storage – Gateway Nodes Spectrum Scale (GPFS) Overview NVMesh Datasheet Clustered Trivial Data-Base (CTDB) NFS-Ganesha PowerVault ME4 support matrix Dell EMC ME4 Series Storage System Administrator’s Guide IOzone Benchmark IOR & MDtest Benchmarks Open-MPI Software 48 De
References Benchmark Reference This section describes the commands that were used to benchmark the DellEMC HPC Storage solutions. IOzone The following commands are examples to run sequential and random IOzone tests, the results of which are recorded in Performance evaluation.
References IOR (N to 1) The following command is an example to run the IOR N-1 performance tests: mpirun --allow-run-as-root -np $Threads --hostfile $hostlist --map-by node -np $threads ~/bin/ior -a POSIX -v -w -r -i 3 -t 8m -b $file_size -g -d 3 -e -E -k -o $working_dir/ior.out -s 1 Table 13 describes the IOR command line options.
References MDtest The following command is an example to run the metadata tests: mpirun --allow-run-as-root -machinefile $hostlist --map-by node -np $threads $mdtest -v -d $working_dir -i ${repetitions} -b $n_directories -z 1 -L -I $n_files -y -u -t -F -w 4K -e 4K Table 14 describes the MDtest command line options.