HP-UX VxFS tuning and performance Technical white paper Table of contents Executive summary............................................................................................................................... 2 Introduction ......................................................................................................................................... 2 Understanding VxFS.......................................................................................................................
Executive summary File system performance is critical to overall system performance. While memory latencies are measured in nanoseconds, I/O latencies are measured in milliseconds. In order to maximize the performance of your systems, the file system must be as fast as possible by performing efficient I/O, eliminating unnecessary I/O, and reducing file system overhead. In the past several years many changes have been made to the Veritas File System (VxFS) as well as HP-UX. This paper is based on VxFS 3.
Understanding VxFS In order to fully understand some of the various file system creation options, mount options, and tunables, a brief overview of VxFS is provided. Software versions vs. disk layout versions VxFS supports several different disk layout versions (DLV). The default disk layout version can be overridden when the file system is created using mkfs(1M). Also, the disk layout can be upgraded online using the vxupgrade(1M) command. Table 1.
Extent allocation When a file is initially opened for writing, VxFS is unaware of how much data the application will write before the file is closed. The application may write 1 KB of data or 500 MB of data. The size of the initial extent is the largest power of 2 greater than the size of the write, with a minimum extent size of 8k. Fragmentation will limit the extent size as well. If the current extent fills up, the extent will be extended if neighboring free space is available.
File system reorganization attempts to collect large areas of free space by moving various file extents and attempts to defragment individual files by copying the data in the small extents to larger extents. The reorganization is not a compaction utility and does not try to move all the data to the front on the file system. Note Fsadm extent reorganization may fail if there are not sufficient large free areas to perform the reorganization.
Data access methods Buffered/cached I/O By default, most access to files in a VxFS file system is through the cache. In HP-UX 11i v2 and earlier, the HP-UX Buffer Cache provided the cache resources for file access. In HP-UX 11i v3 and later, the Unified File Cache is used. Cached I/O allows for many features, including asynchronous prefetching known as read ahead, and asynchronous or delayed writes known as flush behind.
Read ahead with VxVM stripes The default value for read_pref_io is 64 KB, and the default value for read_nstream is 1, except when the file system is mounted on a VxVM striped volume where the tunables are defaulted to match the striping attributes. For most applications, the 64 KB read ahead size is good (remember, VxFS attempts to maintain 4 segments of read ahead size).
example, two threads can be reading from the same file sequentially and both threads can benefit from the configured read ahead size. Figure 3. Enhanced read ahead 64K 128K 64K 128K 64K Patterned Read Enhanced read ahead can be set on a per file system basis by setting the read_ahead tunable to 2 with vxtunefs(1M). Read ahead on 11i v3 With the introduction of the Unified File Cache (UFC) on HP-UX 11i v3, significant changes were needed for VxFS.
Figure 4. Flush behind 64K 64K 64K flush behind 64K 64K Sequential write As data is written to a VxFS file, VxFS will perform “flush behind” on the file. In other words, it will issue asynchronous I/O to flush the buffer from the buffer cache to disk. The flush behind amount is calculated by multiplying the write_pref_io by the write_nstream file system tunables. The default flush behind amount is 64 KB. Flush behind on HP-UX 11i v3 By default, flush behind is disabled on HP-UX 11i v3.
Figure 5. I/O throttling 64K 64K 64K 64K 64K flush behind Sequential write Flush throttling (max_diskq) The amount of dirty data being flushed per file cannot exceed the max_diskq tunable. The process performing a write() system call will skip the “flush behind” if the amount of outstanding I/O exceeds the max_diskq.
The read flush behind feature has the advantage of preventing a read of a large file (such as a backup, file copy, gzip, etc) from consuming large amounts of file cache. However, the disadvantage of read flush behind is that a file may not be able to reside entirely in cache. For example, if the file cache is 6 GB in size and a 100 MB file is read sequentially, the file will likely not reside entirely in cache.
Note that buffer merging was not implemented initially on IA-64 systems with VxFS 3.5 on 11i v2. So using the default max_buf_data_size of 8 KB would result in a maximum physical I/O size of 8 KB. Buffer merging is implemented in 11i v2 0409 released in the fall of 2004. Page sizes on HP-UX 11i v3 and later On HP-UX 11i v3 and later, VxFS performs cached I/O through the Unified File Cache. The UFC is paged based, and the max_buf_data_size tunable on 11i v3 has no effect. The default page size is 4 KB.
Note The OnlineJFS license is required to perform direct I/O when using VxFS 5.0 or earlier. Beginning with VxFS 5.0.1, direct I/O is available with the Base JFS product. Discovered direct I/O With HP OnLineJFS, direct I/O will be enabled if the read or write size is greater than or equal to the file system tunable discovered_direct_iosz. The default discovered_direct_iosz is 256 KB. As with direct I/O, all discovered direct I/O will be synchronous.
Figure 8. Example of unaligned direct I/O 8 KB fs block 4K 4K 8 KB fs block 8K 8 KB fs block 4K 4K 16k random write using Direct I/O Buffered I/O 8K Direct I/O 8K Buffered I/O 8k Using a smaller block size, such as 1 KB, will improve chances of doing more optimal direct I/O. However, even with a 1 KB file system block size, unaligned I/Os can occur. Also, when doing direct I/O writes, a file‟s buffers must be searched to locate any buffers that overlap with the direct I/O request.
Concurrent I/O The main problem addressed by concurrent I/O is VxFS inode lock contention. During a read() system call, VxFS will acquire the inode lock in shared mode, allowing many processes to read a single file concurrently without lock contention. However, when a write() system call is made, VxFS will attempt to acquire the lock in exclusive mode. The exclusive lock allows only one write per file to be in progress at a time, and also blocks other processes reading the file.
Figure 10. Locking with concurrent I/O (cio) To enable a file system for concurrent I/O, the file system simply needs to be mounted with the cio mount option, for example: # mount -F vxfs -o cio,delaylog /dev/vgora5/lvol1 /oracledb/s05 Concurrent I/O was introduced with VxFS 3.5. However, a separate license was needed to enable concurrent I/O. With the introduction of VxFS 5.0.1 on HP-UX 11i v3, the concurrent I/O feature of VxFS is now available with the OnlineJFS license.
Zero Filled On Demand (ZFOD) extents are new with VxFS 5.0 and are created by the VX_SETEXT ioctl() with the VX_GROWFILE allocation flag, or with setext(1M) growfile option. Not all applications support concurrent I/O By using a shared lock for write operations, concurrent I/O breaks some POSIX standards. If two processes are writing to the same block, there is no coordination between the files, and data modified by one process can be lost by data modified by another process.
Creating your file system When you create a file system using newfs(1m) or mkfs(1m), you need to be aware of several options that could affect performance. Block size The block size (bsize) is the smallest amount of space that can be allocated to a file extent. Most applications will perform better with an 8 KB block size. Extent allocations are easier and file systems with an 8 KB block size are less likely to be impacted by fragmentation since each extent would have a minimum size of 8 KB.
Table 3. Default intent log size FS Size VxFS 3.5 or 4.1 or DLV <= 5 VxFS 5.0 or later and DLV >= 6 1 MB 1 MB <= 512 MB 16 MB 16 MB <= 16 GB 16 MB 64 MB > 16 GB 16 MB 256 MB <= 8 MB For most applications, the default log size is sufficient. However, large file systems with heavy structural changes simultaneously by multiple threads or heavy synchronous write operations with datainlog may need a larger Intent Log to prevent transaction stalls when the log is full.
Note To improve your large directory performance, upgrade to VxFS 5.0 or later and run vxupgrade to upgrade the disk layout version to version 7. Mount options blkclear When new extents are allocated to a file, extents will contain whatever was last written on disk until new data overwrites the old uninitialized data. Accessing the uninitialized data could cause a security problem as sensitive data may remain in the extent.
Using datainlog has no affect on normal asynchronous writes or synchronous writes performed with direct I/O. The option nodatainlog is the default for systems without HP OnlineJFS, while datainlog is the default for systems that do have HP OnlineJFS. mincache By default, I/O operations are cached in memory using the HP-UX buffer cache or Unified File Cache (UFC), which allows for asynchronous operations such as read ahead and flush behind.
cio The cio option enables the file system for concurrent I/O. Prior to VxFS 5.0.1, a separate license was needed to use the concurrent I/O feature. Beginning with VxFS 5.0.1, concurrent I/O is available with the OnlineJFS license. Concurrent I/O is recommended with applications which support its use. remount The remount option allows for a file system to be remounted online with different mount options. The mount -o remount can be done online without taking down the application accessing the file system.
Table 4. Intent log flush behavior with VxFS 3.
qio The qio option enables Veritas Quick I/O for Oracle database. Dynamic file system tunables File system performance can be impacted by a number of dynamic file system tunables. These values can be changed online using the vxtunefs(1M) command, or they can be set when the file system is mounted by placing the values in the /etc/vx/tunefstab file (see tunefstab(4)). Dynamic tunables that are not mentioned below should be left at the default.
read_ahead The default read_ahead value is 1 and is sufficient for most file systems. Some file systems with nonsequential patterns may work best if enhanced read ahead is enabled by setting read_ahead to 2. Setting read_ahead to 0 disables read ahead. This tunable does not affect direct I/O or concurrent I/O. discovered_direct_iosz Read and write requests greater than or equal to discovered_direct_iosz are performed as direct I/O.
cache grows quickly as new buffers are needed. The buffer cache is slow to shrink, as memory pressure must be present in order to shrink buffer cache. Figure 13. HP-UX 11iv2 buffer cache tunables Memory dbc_min_pct dbc_max_pct The buffer cache should be configured large enough to contain the most frequently accessed data. However, processes often read large files once (for example, during a file copy) causing more frequently accessed pages to be flushed or invalidated from the buffer cache.
Unified File Cache on HP-UX 11i v3 Filecache_min / filecache_max The Unified File Cache provides a similar file caching function to the HP-UX buffer cache, but it is managed much differently. As mentioned earlier, the UFC is page-based versus buffer-based. HP-UX 11i v2 maintained separate caches, a buffer cache for standard file access, and a page cache for memory mapped file access. The UFC caches both normal file data as well as memory mapped file data, resulting in a unified file and page cache.
The kernel tunable vxfs_bc_bufhwm specifies the maximum amount of memory in kilobytes (or high water mark) to allow for the buffer pages. By default, vxfs_bc_bufhwm is set to zero, which means the default maximum sized is based on the physical memory size (see Table 5). The vxfsstat(1M) command with the -b option can be used to verify the size of the VxFS metadata buffer cache. For example: # vxfsstat -b / : : buffer cache statistics 120320 Kbyte current 544768 maximum 98674531 lookups 99.
Note When setting vx_ninode to reduce the JFS inode cache, use the -h option with kctune(1M) to hold the change until the next reboot to prevent temporary hangs or Serviceguard TOC events as the vxfsd daemon become very active shrinking the JFS inode cache. vxfs_ifree_timelag By default, the VxFS inode cache is dynamically sized. The inode cache typically expands very rapidly, then shrinks over time.
VxFS ioctl() options Cache advisories While the mount options allow you to change the cache advisories on a per-file system basis, the VX_SETCACHE ioctl() allows an application to change the cache advisory on a per-file basis. The following options are available with the VX_SETCACHE ioctl: VX_RANDOM - Treat read as random I/O and do not perform read ahead VX_SEQ - Treat read as sequential and perform maximum amount of read ahead VX_DIRECT - Bypass the buffer cache. All I/O to the file is synchronous.
Patches Performance problems are often resolved by patches to the VxFS subsystem, such as the patches for the VxFS read ahead issues on HP-UX 11i v3. Be sure to check the latest patches for fixes to various performance related problems. Summary There is no single set of values for the tunable to apply to every system. You must understand how your application accesses data in the file system to decide which options and tunables can be changed to maximize the performance of your file system.
For additional information For additional reading, please refer to the following documents: HP-UX VxFS mount options for Oracle Database environments, http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA1-9839ENW&cc=us&lc=en Common Misconfigured HP-UX Resources, http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c01920394/c01920394.pdf Veritas™ File System 5.0.1 Administrator’s Guide (HP-UX 11i v3), http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02220689/c02220689.