HP IO Accelerator Performance Tuning Guide Abstract This guide is designed to help verify and improve HP IO Accelerator performance.
© Copyright 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Microsoft, Windows, and Windows Server are U.S.
Contents Introduction .................................................................................................................................. 5 About the Performance and Tuning Guide ...................................................................................................... 5 System performance ...................................................................................................................... 6 Verifying Linux system performance......................................
Setting Windows driver affinity .................................................................................................................... 29 Acronyms and abbreviations ........................................................................................................ 31 Index .........................................................................................................................................
Introduction About the Performance and Tuning Guide Welcome to the Performance and Tuning Guide for the HP IO Accelerator. This guide is designed to help you achieve the following objectives: • Verify IO Accelerator performance on Linux, including using sample benchmarks and solving common performance issues. • Verify IO Accelerator performance on Windows® operating system, including using sample benchmarks and solving common performance issues.
System performance Verifying Linux system performance To verify Linux system performance with an IO Accelerator, HP recommends using the fio benchmark. This benchmark was developed by Jens Axboe, a Linux kernel developer. Fio is included in many distributions, or can be compiled from source. The latest source distribution (http://freshmeat.net/projects/fio) requires having the libaio development headers in place.
$ fio --filename=/dev/fioa --direct=1 --rw=randwrite --bs=4k --size=5G --numjobs=64 --runtime=10 --group_reporting --name=file1 These tests are also available as fio job input files, which can be requested from HP support (http://www.hp.com/go/support). The latest expected performance numbers for your card type can be found in the HP PCIe IO Accelerator for ProLiant Servers Data Sheet (http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA0-4235ENW.pdf). Your results should exceed those on the data sheet.
READ: io=4,058MiB, aggrb= 414MiB/s, minb=414MiB/s, maxb=414MiB/s, mint=10036msec, maxt=10036msec Disk stats (read/write): fioa: ios=1038929/0, merge=0/0, ticks=389591/0, in_queue=61732674, util=99.12% The output below shows the test achieving 779 MiB/sec, with random reads done on 1MB blocks. Read bandwidth test on Linux: file1: (g=0): rw=randread, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1 ... file1: (g=0): rw=randread, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1 Starting 4 processes Jobs: 4 (f=4): [rrrr] [100.
Starting 64 processes Jobs: 64 (f=64): [wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww] [100.0% done] [ 0/422981 kb/s] [eta 00m:00s] file1: (groupid=0, jobs=64): err= 0: pid=28964 write : io=4,138MiB, bw=424MiB/s, iops=106K, runt= 1001msec clat (usec): min=31, max=68,803, avg=589.54, stdev=67.72 bw (KiB/s) : min= 0, max=17752, per=1.44%, avg=6231.17, stdev=182.09 cpu : usr=0.37%, sys=2.97%, ctx=10599830, majf=0, minf=576 IO depths >64=0.0% >64=0.0% >64=0.0% : 1=100.0%, 2=0.0%, 4=0.
Debugging performance issues Improperly configured benchmark Issue The most common issue in achieving performance with the IO Accelerator is the failure to properly set up the micro benchmark. Solution Be sure that you start with the benchmarks described in the previous sections to insure that the system is performing properly with a known benchmark.
Send the bundle to support@fusionio.com, and request assistance in debugging a performance issue. [Reviewers: how do HP customers handle this?] Below is an example of a bandwidth error reported by fio-pci-check, caused by an over-subscription condition. Any line indicated with an asterisk indicates a possible problem detected by the fio-pci-check utility. fio-pci-check Root Bridge PCIe 3000 MB/sec Bridge 00:02.00 (01-05) *Needed 3000 MB/sec Avail 2000 MB/sec Bridge 01:00.
IMPORTANT: Some PCI Express chips do not properly report PCIe errors, or they might report errors when none exist. In most cases, this occurs on a bridge chip. This failure typically shows under the following conditions: • Multiple rapid executions of the fio-pci-check utility were issued. • No data is passing over the bus reporting errors. • All drivers for attached peripherals are unloaded. Below is an example of PCI Express errors captured on a system with an IO Accelerator.
Correctable Error Reporting: enabled Non-fatal Error Reporting: enabled Unsupported Request Reporting: enabled Current status: 0x0000 Correctable Error(s): None Non-Fatal Error(s): None Fatal Error(s): None Unsupported Type(s): None Current link capabilities: 0x02014501 Maximum link speed: 2.5 Gb/s Maximum link width: 16 lanes Current link capabilities: 0x00001041 Link speed: 2.5 Gb/s Link width is 4 lanes ioDrive 01:00.
Benchmarking through a filesystem Issue Although using a filesystem is necessary for most storage deployments, it involves additional work to access the data stored on the IO Accelerator. These additional lookups decrease maximum system performance when compared to the benchmark results achieved by benchmarking directly on the block device. Solution When you are running micro-benchmarks to vet system performance, you should benchmark by accessing the block device directly.
to avoid this issue. For more information, see the patch (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b90f687018e6d6 ) and bug report (https://bugzilla.kernel.org/show_bug.cgi?id=15579). The fix is included in RHEL6 pre-release kernel kernel-2.6.32-23.el6. The eventual release RHEL6 kernel is not affected by this issue. Discard support was added to the kernel.org mainline ext4 in Version 2.6.28 and was enabled by default.
General tuning techniques Using direct I/O, unbuffered, or zero copy 10 Traditional I/O paths include the page cache, a DRAM cache of data stored on the disk. The IO Accelerator is fast enough that this and other traditional optimizations, such as I/O merging and reordering, are actually detrimental to performance. I/O merging and reordering are eliminated naturally by the IO Accelerator, but the page cache must be bypassed at the application level. Direct I/O bypasses the page cache.
protocols and have long pipelines, the IO Accelerator does not suffer from major latency increases as the number of outstanding I/Os increases. The primary methods for generating outstanding I/Os are: • Using multiple threads • Using multiple processes • Using AIO For small-packet IOPS-geared applications, having multiple threads or outstanding AIO requests generally yields a significant performance improvement over a single thread.
• Using the fio-format command to re-initialize the drive • Performing large sequential writes to the drive For more details on using the fio-format command, see the IO Accelerator User Guide for your operating system. Increasing outstanding requests allowed by the kernel (Linux only) This section applies to running the 2.0 driver series and later on Linux, with the default value (3) for the use_workqueue driver option.
NOTE: HP recommends using this option only with a 4 KiB sector size. Values Description Indicates devices that need memory pre-allocated. A null list disables the pre-allocation of memory. preallocate_mb Description Specifies the number of megabytes of memory to pre-allocate. The sector size is set with the fio-format command-line utility using the -b option. Larger sector sizes result in less memory utilization.
Tuning techniques for writes Increased steady-state write performance with fio-format Under a sufficiently long sustained workload, write performance decreases. This performance drop, or steady state write performance, is common to all solid-state storage technologies. Even though most enterprise workloads induce steady state write behavior in IO Accelerator devices, they have high enough native performance that most applications see little to no performance penalty.
Linux filesystem tuning ext2-3-4 tuning XFS is currently the recommended filesystem. It can achieve up to three times the performance of a tuned ext2/ext3 solution. At this time, there is no known additional tuning for running XFS in a single- or multi-IO Accelerator configuration. Setting stride size and stripe width for ext2/3 (extN) when using RAID The extN filesystem family has create-time options of stride and stripe width.
(stride) 32K = 256K / 8K (dbd) 4 = 5 - 1 (stripe_width) 128 = 4 * 32K This results in the following command: $ mkfs.ext3 -b 8192 -E stride=32 -E stripe_width=128 /dev/md0 For more information on setting the stripe size, see the article "Optimizing the EXT3 file system on CentOS (http://wiki.centos.org/HowTos/Disk_Optimization)." Using the IO Accelerator as swap space NOTE: Adding multiple swap partitions usually slows down swap performance, even if they are set at the same priority.
fio benchmark Compiling the fio benchmark The fio benchmarking utility is used to verify Linux system performance with an IO Accelerator. To compile the fio utility, get the latest version of fio source from the tarball link (http://freshmeat.net/projects/). For a tar balls of all fio releases, see the Index of /snaps (http://brick.kernel.dk/snaps). 1. Install the necessary standard dependencies, for example: gcc $ yum -y install gcc 2.
Verifying IO Accelerator performance on Windows operating systems Using Iometer to verify IO Accelerator performance on Windows operating systems To set up an IO Accelerator to work with Iometer: 1. Ensure that you have the latest driver and firmware for the target IO Accelerator. For best results, be sure you are using a capable quad-core or higher CPU. 2. Load Iometer and add at least eight threads and 64 outstanding I/O’s per target. 3. Select the drive to test for each thread. 4.
Programming using direct I/O Using direct I/O on Linux Under Linux, the best way to enable direct I/O is on a per-file basis. This is done by using the O_DIRECT flag to the open() system call.
memset(buf, 0xaa, ps*256); if( (fd = open(FILENAME, O_WRONLY | O_DIRECT) ) < 0 ) { perror("Open failed"); exit(ret); } bytes_written = 0; while( (ret = pwrite(fd, buf, ps*256, bytes_written)) == ps*256) { bytes_written += ret; } printf("Wrote %lld GB\n", bytes_written/1000/1000/1000); close(fd); free(buf); } Using direct I/O on Windows Direct I/O on Windows operating systems is set up through the CreateFile() call, using the FILE_FLAG_NO_BUFFERING flag.
int open(char* filename); int write(char* buf, int size_t); int close(); size_t gbytesWritten() { return bytes_written / 1000 / 1000 / 1000; } private: int fd; size_t bytes_written; }; int DirectFile::open(char* filename) { int ret = 0; fd = ::open(filename, O_DIRECT | O_WRONLY, S_IRUSR | S_IWUSR | S_IRGRP); if( fd == -1 ) { ret = -1; } bytes_written = 0; return ret; } int DirectFile::write(char* buf, int bufsize) { int written = pwrite(fd, buf, bufsize, bytes_written); if( written >= 0 ) bytes_written += w
exit(ret); } do { ret = file.write(buf, bufsize); if( ret < 0 ) { cerr << endl << "Error writing bytes to " << FILENAME << ". written=" << file.gbytesWritten() << "GBytes"; cerr << ", errno=" << errno << " " << strerror(errno) << endl; } else { cout << "."; } } while( ret > 0 ); if( ret >= 0 ) cout << endl << "Wrote " << file.
Windows driver affinity Setting Windows driver affinity On a multiprocessor system, the operating system routes an I/O request through as efficient a path as its programming permits. Often this path is not the optimal performance path, primarily due to system architecture. A user who is aware of the particular hardware layout of a system can maximize driver performance by specifying the routing of its I/O.
• Buses 36 and 37 : Node 1 (processors 2 and 3) • Buses 82 and 86 : Node 2 (processors 4 and 5) • Buses 169 and 170 : Node 3 (processors 6 and 7) Because the Windows Server® operating system designates each processor according to its node numbering, it assigns the two processors of node 0 as processor 0 and processor 1. These are represented internally by a bitmask whose offsets correspond to the processor numbers.
Acronyms and abbreviations AIO asynchronous input/output CPU central processing unit DMA direct memory access DRAM dynamic random access memory I/O input/output IOPS input/output operations per second LBA logical block addressing NUMA Non-Uniform Memory Architecture PCIe peripheral component interconnect express QDR quad data rate SSS solid state storage Acronyms and abbreviations 31
Index A M About this guide 5 multiple outstanding IOs 16 B O benchmarking through filesystem 14 outstanding requests, increasing 18 oversubscription 10 C C++ code sample 26 compiling fio benchmark 23 CP, using 14 CPU auto-idling 13 CPU thermal throttling 13 D debugging direct I/O direct I/O, direct I/O, direct I/O, performance issues 10 16, 25 Linux 25 programming 25 Windows 26 P PCIe errors 11 PCIe link width 12 performance issues 10 preallocate_mb 19 preallocate_memory 18 pre-allocating memory
Windows driver affinity, setting 29 Windows system performance 9, 24 write bandwidth test 7 Z zero copy 10 16 Index 33