HP-UX TCP/IP Performance White Paper, March 2008

ManualsBrandsHP ManualsSoftwareHP-UX 11i v3 ONC and NFS Software

HP-UX 11i TCP/IP Performance White Paper

1 Introduction.............................................................................................................................3

1.1 Intended Audience............................................................................................................3

1.2 Organization of the document.............................................................................................3

1.3 Related Documents............................................................................................................3

1.4 Acknowledgements: ..........................................................................................................4

2 Out of the Box TCP/IP Performance Features for HP-UX Servers........................................................5

2.1 TCP Window Size and Window Scale Option (RFC 1323)......................................................5

2.2 Selective Acknowledgement (RFC 2018)...............................................................................6

2.3 Limited Transmit (RFC 3042)...............................................................................................6

2.4 Large Initial Congestion Window (RFC 3390)........................................................................6

2.5 TCP Segmentation Offload (TSO).........................................................................................7

2.6 Packet Trains for IP fragments..............................................................................................7

3 Advanced Out of the Box Scalability and Performance Features.......................................................9

3.1 TOPS..............................................................................................................................9

3.1.1 Configuration Scenario for TOPS ...................................................................................9

3.1.2 socket_enable_tops Tunable..........................................................................................9

3.2 STREAMS NOSYNC Level Synchronization.........................................................................10

3.2.1 IP NOSYNC synchronization.......................................................................................10

3.3 Protection from Packet Storms............................................................................................11

3.3.1 Detect and Strobe Solution..........................................................................................11

3.3.2 HP-UX Networking Responsiveness Features...................................................................11

3.3.3 Responsiveness Tuning...............................................................................................12

3.4 Interrupt Binding and Migration.........................................................................................12

3.4.1 Configuration Scenario for Interrupt Migration................................................................13

3.4.2 Cache Affinity Improvement ........................................................................................13

4 Improving HP-UX Server Performance........................................................................................14

4.1 Tuning Application and Database Servers...........................................................................14

4.1.1 Tuning Application Servers..........................................................................................15

4.1.2 Tuning Database Servers............................................................................................17

4.2 Tuning Web Servers........................................................................................................18

4.2.1 Network Server Accelerator HTTP.................................................................................18

Summary of content (89 pages)

PAGE 1
HP-UX 11i TCP/IP Performance White Paper 1 Introduction.............................................................................................................................3 1.1 Intended Audience ............................................................................................................3 1.2 Organization of the document.............................................................................................3 1.3 Related Documents................................................
PAGE 2
4.2.2 Socket Caching for TCP Connections ............................................................................ 20 4.2.3 Tuning tcphashsz....................................................................................................... 21 4.2.4 Tuning the listen queue limit......................................................................................... 22 4.2.5 Using MSG_EOF flag for TCP Applications .................................................................... 24 4.
PAGE 3
1 Introduction This white paper is intended as a guide to tuning networking performance at the network and transport layers. This includes IPv4, IPv6, TCP, UDP, and related protocols. Some topics will touch on other areas including sockets interfaces, network interface drivers and application protocols; however that is not the focus of this paper. Other information is available for these subsystems as referenced below. 1.
PAGE 4
§ § § § § § RFC 2018: TCP Selective Acknowledgement Options RFC 2861: TCP Congestion Window Validation RFC 3042: Enhancing TCP's Loss Recovery Using Limited Transmit RFC 3390: Increasing TCP's Initial Window RFC 3782: The NewReno Modification to TCP's Fast Recovery Algorithm RFC 4138: Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts 1.
PAGE 5
2 Out of the Box TCP/IP Performance Features for HP-UX Servers The HP-UX Networking Stack is especially engineered and tested for optimum performance in an enterprise mission-critical environment. HP-UX 11i v3 exhibits excellent performance on NFS server performance and in the TPC-C benchmark, a measurement of intensive online transaction processing (OLTP) in a database environment. Typically, OLTP includes a mixture of read-only or update, short or long, and interactive or deferred database transactions.
PAGE 6
sufficiently large for a given bandwidth-delay product, the transport is better positioned to taking full advantage of the remote TCP's advertised window. 2.2 Selective Acknowledgement (RFC 2018) TCP may experience poor performance when multiple packets are lost from one window of data. Selective Acknowledgment (SACK), described in RFC 2018, is effective in recovering from loss of multiple segments in a window.
PAGE 7
The large initial congestion window (RFC 3390) increases the permitted initial window from one or two segments to four segments or 4380 bytes, whichever is less. For example, when MSS is 1460 bytes, the TCP connection starts with three segments (3*1460=4380). By default, HP-UX uses the large initial congestion window. This is configured by the ndd tunable tcp_cwnd_initial. The large initial congestion window is especially effective for connections that need to send a small quantity of data.
PAGE 8
A single fragment dropped by the driver will cause an entire datagram to be unrecoverable. When the remote machine picks up the remaining fragments, they will be queued in its reassembly queue, according to the IP protocol. If this happens frequently, the entire IP reassembly queue on the receiving side will be exhausted. This, in turn, would result in good packets being dropped because of the full buffer on the receiving side. To mitigate this problem, HP-UX uses Packet Trains.
PAGE 9
3 Advanced Out of the Box Scalability and Performance Features The HP-UX Networking Stack has been engineered for best scalability and performance for high end servers. It can gracefully scale up from a few processors to 256 processors, and from 10 BaseT to 10 Gigabit Ethernet.
PAGE 10
tunable will be provided in a future patch. This may be useful in cases described below where specific conditions make the TOPS default less than optimal. Refer to Table 2 (at the end of Appendix B) for the patch level information for the ndd tunable socket_enable_tops. It should not be necessary to disable TOPS. However, there are cases where the scalability issue addressed by TOPS does not exist.
PAGE 11
modules pushed on the DLPI stream create or modify the modules to operate at the NOSYNC synchronization level so that the NOSYNC performance gain is not lost. For more details about writing a NOSYNC module/driver refer to: STREAMS/UX Programmer's Guide, available at http://docs.hp.com/en/netcom.html#STREAMS/UX Patch level information for the NOSYNC feature: 11i v1: • STREAMS: PHNE_35453 or higher • ARPA Transport: PHNE_35351 or higher • DLPI: PHNE_33704 or higher • IPFilter: A.03.05.
PAGE 12
eliminating points of contention to allow more parallelism in TCP/IP processing, HP-UX has eliminated many causes of delay in the kernel, even when the system is under extreme load. In addition, the Detect and Strobe feature will be activated if the incoming traffic is more than the system can handle. 3.3.3 Responsiveness Tuning The cost of providing responsiveness for the overall system in the case of packet storms is that incoming network interrupts can be delayed or even dropped.
PAGE 13
3.4.1 Configuration Scenario for Interrupt Migration A significant amount of network protocol processing for inbound packets is done as part of the interrupt from the network interface. In order to avoid a CPU bottleneck when there is heavy network traffic, Interrupt Migration can be used to move interrupts away from heavily-loaded processors.
PAGE 14
4 Improving HP-UX Server Performance 4.1 Tuning Application and Database Servers Many of the enterprise applications today are architected and built using the J2EE framework, which is designed for the mainframe-scale computing typical of large enterprises. The J2EE framework provides a way to architect solutions which are distributed, multi-tiered and scalable. The diagram below shows an overview of the multi-tiered J2EE application architecture.
PAGE 15
Enterprise Data Center Firewall Load balancer App Server DB Server Web Server Internet/Leased lines http https Open Zone DMZ MZ 4.1.1 Tuning Application Servers Network traffic characteristics of a physical server which is used to run as an application server varies based on its usage context and the nature of applications (business logic) that they run. Web applications are normally implemented using technologies such as servlets and JSP scripts.
PAGE 16
number of TCP connections, as in the case with Web servers, may result in a large number of connections staying in TIME_WAIT state before getting closed. Application server vendors may typically suggest tuning this parameter related to TCP’s TIME_WAIT timer. With the default value of 60 seconds for tcp_time_wait_interval on HP-UX, the HP-UX stack can track literally millions of TIME_WAIT connections with no particular decrease in performance and only a slight cost in terms of memory.
PAGE 17
4.1.1.6 tcphashsz tcphashsz controls the size of several hash tables maintained within the kernel. For better performance it is better to have larger tables at the expense of more memory being used when there is a large number of concurrent connections in the system. On modern-day servers memory may not be a major constraint. When Web server and application servers are run on the same physical machine, the suggested value for this tunable parameter is 32768.
PAGE 18
4.2 Tuning Web Servers As the demand for faster and more scalable web service increases, it is desirable to improve web server performance and scalability by integrating web server functionality into operating systems. Web servers have characteristics of many short-lived connections, which open and close TCP connections at a very fast rate. In a busy web server environment, there could be tens of thousands of TCP connections per second.
PAGE 19
4.2.1.1.2 Multiple Web Servers with Partitioned Content High-traffic Web sites typically feature multiple servers that are dedicated for specific purposes. A given set of servers, for example, may serve specific content such as images, advertisements, audio, or video. Dedicating servers to specific content types limits the total working set that must be delivered by any single server and allows the server's hardware configuration to be tailored to its content.
PAGE 20
The max_uri_page_size is specified in bytes. For example, the command nsahttp -m 2097152 causes NSA HTTP to cache only web pages that contain 2MB or fewer. 4.2.1.3 Performance Data A simulated web server environment was used to measure the performance of NSA HTTP. The workload was a mix of static content (70%) and dynamic content (30%). The measurements were taken using Web servers that implement copy avoidance when servicing static requests. The performance improvement was about 13-17%.
PAGE 21
socket_caching_tcp tunable controls both IPv4 and IPv6 TCP connections. Please refer to the ndd help text for more information. The ndd help text for socket_caching_tcp may be obtained by executing the following command: # ndd -h socket_caching_tcp The number of elements to be cached for optimal performance depends upon the frequency of open/close and on how the number of simultaneous connections changes over time.
PAGE 22
has been changed to 0. A value of 0 (default) will auto-tune the tcphashsz in proportion to the number of cores in the system during the system bootup. The minimum value is 0, and the maximum is 65536. If this tunable is set in the range of 1 to 255, then it will be increased to 256. It is recommended that the value of tcphashsz be set to 0 so that the system can choose an optimal tcphashsz value. 4.2.4 Tuning the listen queue limit The listen queue limit can affect Web server performance.
PAGE 23
On the other hand, if the system is found to be operating at full capacity, and yet clients are not being turned away, and if a high proportion of those clients experience long waiting times before service, then one may consider one of the following courses of action: • Limit the number of requests that are accepted into the listen queue, and cause the excess requests to be turned away. • Upgrade the system hardware to provide extra capacity.
PAGE 24
Note however that the depth of the listen queue and the rate of dropped requests represent the balance between the rate at which the requests arrive and the rate at which the server is able to accept connections. For information on implementation of server programs to make effective use of the listen backlog, please refer to section 5.4. 4.2.5 Using MSG_EOF flag for TCP Applications The MSG_EOF feature improves TCP application performance by piggybacking the FIN segment on the last data.
PAGE 25
4.3 Tuning Servers in Wireless Networks Cellular mobile wireless networks have characteristics of a long latency, large bandwidth-delay product, and volatile delays. Because cellular wireless networks have a long latency of a few hundreds of milliseconds to a few seconds, their bandwidth-delay product is large, especially for the 3G wireless and beyond. Therefore, the TCP window size needs to be set sufficiently large in a wireless network environment.
PAGE 26
avoids additional unnecessary retransmissions and accelerates the recovery of the cwnd that is shrunk to one segment by the timeout. F-RTO is disabled by default. To enable F-RTO, the following ndd tunable is provided: tcp_frto_enable The valid values for tcp_frto_enable are: 0 The local system does not use F-RTO. This is the default value. 1 The local system uses the F-RTO algorithm for detecting and responding to spurious timeout.
PAGE 27
5 Tuning Applications Using Programmatic Interfaces 5.1 sendfile() The sendfile() system call allows the contents of a file to be transmitted directly over a TCP connection, without the need to copy data to and from the calling application's buffers. This provides zero copy feature for sending data to the remote side. Refer to the sendfile(2) manpage for details on syntax and usage of sendfile.
PAGE 28
/dev/poll driver and registers a set of file descriptors that it wants to monitor along with the set of events it wants to monitor for those file descriptors. Then it can issue an DP_POLL ioctl() call on the event port driver to check which events have occurred on the registered file descriptors. The DP_POLL ioctl() on return specifies the file descriptors (if any) that have events pending and which events have occurred.
PAGE 29
Generally speaking, the value set by the application for the SO_SNDBUF size will be an approximate limit on the sum of the values (a) and (b) above, and the SO_RCVBUF size will be an approximate limit on the sum of the values of (b) and (c) above.
PAGE 30
used over a range of workload conditions, over a range of systems with various performance characteristics. Therefore it is highly beneficial to design such programs so that the choice of listen backlog value can either be configured directly, or adjusted indirectly as a result of some parameters which are based on the workload to be handled by the application.
PAGE 31
6 Monitoring Network Performance There are many tools available on HP-UX for monitoring network statistics.
PAGE 32
6.1.1 Monitoring TCP connections with netstat –an Each socket results in a network connection. Use netstat -an command to determine the state of your existing network connections.
PAGE 33
0 embryonic connections dropped 4985775 segments updated rtt (of 4985775 attempts) 0 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 2 keepalive timeouts 0 keepalive probe sent 0 connections dropped by keepalive 0 connect requests dropped due to full queue 0 connect requests dropped due to no listener 0 suspect connect requests dropped due to aging 0 suspect connect requests dropped due to rate Retransmitted segments are an indication of loss, delay, or reordering of segments
PAGE 34
Important fields for the netstat –p ip command include the statistics for dropped fragments, which should have low values.
PAGE 35
Inbound Discards Inbound Errors Inbound Unknown Protocols Outbound Octets Outbound Unicast Packets Outbound Non-Unicast Packets Outbound Discards Outbound Errors Outbound Queue Length Specific = = = = = = = = = = 0 0 881 10832833 132612 140 0 0 2 655367 = = = = = = = = = = = = 1 0 0 326 742 0 0 0 0 0 0 0 Ethernet-like Statistics Group Index Alignment Errors FCS Errors Single Collision Frames Multiple Collision Frames Deferred Transmissions Late Collisions Excessive Collisions Internal MAC Transmit Erro
PAGE 36
If any CPU is saturated, it could potentially become a bottleneck for network performance, especially if the saturated CPU is handling the interrupts from the network interface. Additional profiling can be taken using Caliper to identify hot spots for CPU utilization. 6.2.1.
PAGE 37
• • • l1dcache Provides miss rate information for the L1 data cache. l2cache Provides miss rate information for the L2 cache. tlb Provides metrics related to translation lookaside buffer (TLB) misses. 6.2.2.
PAGE 38
aspect of paging is that it can cause external fragmentation of the memory. This fragmentation will cause fewer large pages available for use by applications. 6.2.4 Monitoring Memory utilization using vmstat The vmstat(1)command reports certain statistics kept about process, virtual memory, trap, and CPU activity.
PAGE 39
6.3.1 Measuring Throughput with Netperf Bulk Data transfer Using the netperf bulk data transfer feature is useful for verifying the throughput between two hosts and measure available bandwidth between two hosts. On remote host B, start the netserver: # netserver Starting netserver at port 12865 Run netperf from host A: # netperf -H hpipxprlan2 -l 60 -t TCP_STREAM -- -m 32768 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to hpipxprlan2 (192.168.138.
PAGE 40
6.3.3 Key issues for throughput with Netperf traffic Common issues for throughput for netperf transfer are the following: • Throughput is limited by the speed of the slowest link in the path from host A to host B. • The transaction rate is limited by the network latency between host A and host B. • High CPU utilization can also be the limiting factor for bulk transfer throughput, or can result in additional latency for request/response transactions. 6.
PAGE 41
Appendix A: Annotated output of netstat –s (TCP, UDP, IP, ICMP) This chapter contains the annotated output of the netstat –s command run on an HP-UX 11i system. The following annotated output is from a test system for illustration only. There may not be any particular problem or issue as indicated by the output. TCP: The TCP statistics can be retrieved by themselves with the command netstat -p tcp. tcp: 612755429 packets sent This is the total number of packets sent by TCP since boot.
PAGE 42
milliseconds. A 1% retransmission rate implies that one packet out of 100 is being retransmitted. An ssh session is not necessarily going to have "fast" retransmits, and the default value for tcp_rexmit_interval_min is 500 milliseconds. This means that the weighted average for keystroke round-trip time (i.e. echo) is going to be: 0.99 * 10 milliseconds + 0.01 * 500 milliseconds which works-out to 14.9 milliseconds - that 1% retransmission rate is adding nearly 50% to the average response time.
PAGE 43
66 window probe packets A window probe is sent after a remote TCP has advertised a window of zero bytes for some period of time. The intent of the window probe is to solicit a window-update from the receiving TCP. If the receiving TCP advertised a zero window, it means that the receiving application stopped reading data from its end of the connection.
PAGE 44
As with the sending side, you can compute the average inbound segment size by dividing the octet count by the segment count. 2 completely duplicate packets (2920 bytes) A completely duplicate segment, while not fatal, does indicate a slight problem with either the network or the remote stack. One of the likely scenarios of a completely duplicate segment is a failure in the remote stack's Round Trip Time (RTT)/Retransmission Timeout (RTO) mechanisms.
PAGE 45
If the cause is application handshake failure then the applications should be inspected to make sure that the handshake is either repaired or the effects of the handshake will not cause data loss or corruption. 0 segments discarded for bad checksum On a private network, or intranet, this value should be very, very small. It may be larger for an Internet connected system. Even then however, the value should be a very small fraction of the total number of TCP segments received.
PAGE 46
6144 retransmit timeouts This is the number of times TCP's retransmission timer has expired on a connection or connections. When this timer expires a TCP segment will be retransmitted. In general, the Retransmission Timeout (RTO) estimator in HP-UX 11i is rather robust, which means the only time there would be spurious timeouts would be when someone has "tuned" tcp_rexmit_interval_* to bogus values.
PAGE 47
It is also possible that those settings are good, but one or more applications have stopped calling accept() against their listen socket(s) - perhaps they are saturated, or perhaps they are caught in an infinite loop. 1585 connect requests dropped due to no listener This is the number of connection requests (SYN segments) dropped because there was no local endpoint in the LISTEN state.
PAGE 48
There are two most common ways to identify the overflowing UDP socket: • ndd –get /dev/ip ip_udp_status lists the UDP fanout table and one of the columns is “overflows”. If the UDP port is long-lived, this will tell which UDP socket is experiencing the overflows. • Collect a network trace during one of the overflows and look for an ICMP SOURCE QUENCH packet. If the SOURCE QUENCH was generated because of a UDP socket overflow, the SOURCE QUENCH packet will tell you which UDP socket is overflowing.
PAGE 49
The only real "fix" to this is to migrate to IPv6 where the ID field is much larger. 0 packets forwarded This is the number of IP datagrams the system has forwarded. If the system is not intended to be an IP router, and this value is non-zero, it means that one or more other systems are using this system as a router, and that someone forgot to set ip_forwarding to zero with ndd on this system. Two configuration problems need to be fixed.
PAGE 50
ICMP: 219220 215278 Output echo calls to generate an ICMP error message ICMP messages dropped histogram: reply: 1134 An "echo reply" will be sent in response to an ICMP Echo Request – aka the type of packet traditionally sent by the ping utility. destination unreachable: 2804 source quench: 0 One can use ndd to set ip_send_source_quench to zero, see Appendix B. routing redirect: 0 echo: 0 This is the number of "ping" (aka ICMP Echo) requests the system has sent.
PAGE 51
ICMPv6: 33 calls to generate an ICMPv6 error message 8 ICMPv6 messages dropped 12 ICMPv6 error messages dropped for rate control In IPv6, in order to limit the bandwidth and forwarding costs incurred by originating ICMPv6 error messages, an IPv6 node limits the rate of ICMPv6 error messages it originates. This situation may occur when a source sending a stream of erroneous packets fails to heed the resulting ICMPv6 error messages.
PAGE 52
router solicitation: 4 router advertisements: 0 neighbor solicitation: 48 neighbor advertisement: 20 These are the statistics related to the Neighbor Discovery (ND) protocol. redirect: 0 An ICMPv6 redirect message is sent by a router to inform a host of a better first-hop node to reach a particular destination. group query: 0 group response: 72 group reduction: 2 These are the statistics related to the Multicast Listener Discovery protocol.
PAGE 53
Appendix B: Annotated output of ndd –h and discussions of the TCP/IP tunables Proper tuning can result in the operating system more efficiently using network bandwidth and system resources like CPU and memory, and providing more of these resources to applications. In addition, in many cases, network performance can be greatly improved by removing conditions which lead to long protocol based delays.
PAGE 54
ip_forward_directed_broadcasts: Set to 1 to have IP forward subnet broadcasts. Set to 0 to inhibit forwarding. [0,1] Default: 1 A directed broadcast datagram has the broadcast IP address of a remote IP subnet as its destination IP address. Directed broadcasts will only be forwarded if ip_forward_directed_broadcasts is set to one and ip_forwarding is set to one or two. If either ip_forward_directed_broadcasts or ip_forwarding are set to zero, directed broadcasts will not be forwarded.
PAGE 55
exacerbate the problem. If the parameter is set to a value that is large enough that IP wraps packet sequence numbers (IP starts to re-use its sequence numbers) while holding fragments for reassembly, it is possible that IP will assemble a packet with fragments from different packets. In this case, the problem will be detected only if the upper-layer is validating data integrity (using checksums). With a 10 MBit/second link and a 1500-byte MTU, IP sequence numbers may wrap within approximately 80 seconds.
PAGE 56
ip_ipif_status: Display a report of all allocated logical interfaces. A logical interface is created whenever one adds IP addresses to the one already assigned to a "physical" interface. These are also sometimes referred to as aliased addresses and are given names such as lan0:1, lan0:2 and so on. These exist only in the "mind" of IP. Neither the NIC driver nor DLPI knows of its existence.
PAGE 57
ip_ire_gw_probe_interval: Controls the probe interval for Dead Gateway Detection. IP periodically probes active and dead gateways. ip_ire_gw_probe_interval controls the frequency of probing. With retries, the maximum time to detect a dead gateway is ip_ire_gw_probe_interval + 10000 milliseconds. Maximum time to detect that a dead gateway has come back to life is ip_ire_gw_probe_interval.
PAGE 58
recommendations for gateways in RFC 1191, then the next hop MTU will be included in the "Fragmentation Needed" message, and IP will use it. If the gateway does not provide next hop information, then IP will reduce the MTU to the next lower value taken from a table of "popular" media MTUs. [0,3] Default: 1 Setting the value to one will mean that IP datagrams will always have the DF bit set.
PAGE 59
If the system is forwarding IP datagrams, and it is asked to forward a datagram when it knows there is a "better" route, it will send an ICMP "Redirect" message to the source of the datagram to tell it the better route. The system will still forward the datagram. If the value of ip_send_redirects is set to zero, the system will still forward the wayward datagram, but it will not tell the source that there is a better way for it to send its datagrams.
PAGE 60
IPv6 Tunables ip6_def_hop_limit: Sets the default value of the Hop Limit field in the IPv6 header. [1,255] Default: 64 The Hop-Limit field of the IPv6 header is used to ensure that an IPv6 datagram eventually "dies" on the network. Each time an IPv6 datagram goes through an IPv6 router, the Hop-Limit is decremented by one hop. When an IPv6 datagram's Hop-Limit reaches zero, it is discarded.
PAGE 61
possible that IPv6 will assemble a packet with fragments from different packets. In this case, the problem will be detected only if the upperlayer is validating data integrity (using checksums). The fragment identifier field in IPv6 is 32 bits, allowing a much larger range of values than the 16-bit field in IPv4, and greatly reducing the possibility of wrap within the reassembly timeout interval. With Gigabit Ethernet link, IPv6 fragment identifiers may wrap within approximately 52000 seconds.
PAGE 62
in IPv6 for all non-local destinations to improve routing lookup. None of the routes from the netstat –rn output will be deleted. ip6_ire_hash: Displays a report of all routing table entries, in the order searched when resolving an IPv6 address. The IPv6 Internet Routing Entry (IRE6) is the primary data structure that links IPv6 addresses with particular interfaces, attached networks, gateways, and local and remote hosts. The corresponding data structure for IPv4 is an IRE.
PAGE 63
datagrams fragments being dropped, you might consider increasing this value. Before you do, check the frequency of such drops so you can compare it to after you make the change. If the frequency remains the same after you have increased the value, it implies that the IPv6 fragment drops were the result of perceived duplicates and not because there was not enough space.
PAGE 64
to set the random timer interval for flushing neighbor cache entries. The Reachable Time is the time a neighbor is considered reachable after receiving a reachability confirmation. The Reachable time value is a uniformly-distributed random value between "ip6_min_random_factor" and "ip6_max_random_factor" times "ip6_ire_reachable_interval" milliseconds. [5000, -] Default: 30000 (30 sec) It is rare that the following IPv6 values should need to be changed.
PAGE 65
expire in "ip6_nd_probe_delay" seconds. If the entry is still in the DELAY state when the timer expires, the entry's state changes to PROBE. If reachability confirmation is received, the entry's state changes to REACHABLE. [5000, -] Default: 5000 (5 sec) ip6_nd_transmit_interval: The Neighbor Discovery constant RETRANS_TIMER, Section 10 of RFC 2461.
PAGE 66
Sockets Tunables socket_buf_max: Specifies the maximum socket buffer size for AF_UNIX sockets. [1024,2147483647] Default: 262144 bytes socket_caching_tcp: Enables or disables socket caching for TCP sockets for AF_INET and AF_INET6 address families. This value determines how many data structures for TCP sockets the system caches per CPU for each address family. Enabling this feature can improve system performance considerably if the system uses many short-lived connections.
PAGE 67
socket_qlimit_max: Sets maximum number of connection requests for non-AF_INET sockets. [1-2147483647] Default: 4096 socket_udp_rcvbuf_default: Sets the default receive buffer size for UDP sockets. The value of this tunable parameter should not exceed the value of the ndd parameter udp_recv_hiwater_max. Otherwise a socket() call to create UDP socket will fail and return the errno value EINVAL. [1-2147483647] Default: 65535 socket_udp_sndbuf_default: Sets the default send buffer size for UDP sockets.
PAGE 68
following formula: min((4 * MSS), max(2 * MSS, 4380)) where MSS is the maximum segment size for the underlying link. With the new congestion window formula, it is possible for TCP to send a large, initial block of data without waiting for acknowledgements. This is useful in networks with large bandwidth and low error rates and particularly useful for short-lived connections that only need to send ~4Kbytes of data or less.
PAGE 69
The other events that control the ACK generation are the following: • Arrival of out of order segment • Time to send a window update as the application consumed enough data. Using delayed ACKs has very little impact on transmission reliability since ACKs are cumulative. Furthermore, delayed ACKs conserve resources by decreasing the load on the network and the cpu that must generate and process these ACK segments. Delayed ACKs have been found to have a positive impact on the performance of bulk transfer.
PAGE 70
According to the TCP protocol specification, the remote TCP should flush its receive queue when it receives the RESET. This may cause data to be lost. [0-2147483647 Milliseconds] Default: 0 (indefinite) The tcp_fin_wait_2_timeout parameter controls a timer that can be used to terminate connections in the FIN_WAIT_2 state. This should only be used in those cases where the tcp_keepalive_detached_interval mechanism is known to not work.
PAGE 71
tcp_ip_abort_interval: Second threshold timer for established connections. When it must retransmit packets because a timer has expired, TCP first compares the total time it has waited against two thresholds, as described in RFC 1122, 4.2.3.5. If it has waited longer than the second threshold, TCP aborts the connection. For best results, do not set this parameter lower than tcp_time_wait_interval.
PAGE 72
tcp_ip_ttl: TTL value inserted into IP header for TCP packets only. [1, 255] Default: 64 A default value of 64 means that TCP will not communicate with any system that is more than 64 hops (routers) away. This should be sufficient for 99.999% of all cases. However, increasing this value to 255 would have no downside in a non-error case.
PAGE 73
tcp_keepalive_interval: Interval for sending keep-alive probes. If any activity has occurred on the connection or if there is any unacknowledged data when the time-out period expires, the timer is simply restarted. If the remote system has crashed and rebooted, it will presumably know nothing about this connection, and it will issue an RST in response to the ACK. Receipt of the RST will terminate the connection.
PAGE 74
A window size of 32768 bytes is enough to allow 10 Mbit/s of throughput out to a RTT of roughly 25 milliseconds. It would allow 100 Mbit/s of throughput out to roughly 2.5 milliseconds, and 1000 Mbit/s of throughput out to roughly 0.25 milliseconds. Typically, the round-trip time on a local LAN is <= 1 millisecond. For a terrestrial (no satellites) link across the continental US the RTT is anywhere between 30 and 100 milliseconds, though it can be higher.
PAGE 75
preventing TCP from being able to back-off its retransmission timer far enough to be at or above the actual round-trip time of the network. tcp_rexmit_interval_min: Lower limit for computed round trip time-out. Unless you know that all TCP connections from the system are going through links where the RTT is greater than 500 milliseconds, and are also _highly_ variable in their RTTs you should not increase this value.
PAGE 76
It is very unlikely that a value of zero (0) would ever be indicated. One unlikely case would be when one knows that severely bandwidth constrained links are in use and the additional bytes of the SACK option would limit effective bandwidth. tcp_sth_rcv_hiwat: If nonzero, sets the Stream-head flow control high water mark. [0,128000] Default: 0 The stream head flow control high water mark is set to larger of tcp_sth_rcv_hiwat or the receive window of the connection.
PAGE 77
The HP-UX TCP stack can track literally millions of TIME_WAIT connections with no particular decrease in performance and only a slight cost in terms of memory. So, it should almost never be the case that you need to decrease this value from its default of 60 seconds. tcp_ts_enable: RFC 1323 defines a timestamps option that can be sent with every segment.
PAGE 78
sequence number space within 2MSL. The suggested-by RFC's value for 2MSL would be four minutes, hence timestamps should be used whenever a TCP connection is going to run at sustained rates of more than 1 GB per minute, which is ~17 MB/s or ~145 Mbit/s. Admittedly, this is pretty conservative. tcp_tw_cleanup_interval: Interval in milliseconds between checks to see if TCP connections in TIME_WAIT have reached or exceeded the tcp_time_wait_interval.
PAGE 79
A setsockopt() call with a SO_SNDBUF option that exceeds the corresponding kernel parameter value will fail and return the errno value EINVAL. A t_optmgmt() call with an XTI_SNDBUF option that exceeds the corresponding kernel parameter value will fail and return the t_errno value TBADOPT. [1024-2147483647] Default: 2147483647 bytes This tunable can be used to limit the maximum value for tcp_xmit_hiwater_* and for the value passedin via setsockopt() for SO_SNDBUF (similarly XTI_SNDBUF for t_optmgt).
PAGE 80
udp_largest_anon_port: Largest port number to use for anonymous bind requests. [1024,65535] Default: 65535 This is analogous to the tcp_largest_anon_port tunable. udp_status: Obtain UDP information report similar to "netstat -an". Requests for this report through concurrent execution of ndd instances are serialized through semaphore. Hence udp_status report invocation through ndd may appear to hang incase there is an ndd instance generating tcp_status/udp_status report already running on the system.
PAGE 81
Table 1: Summary of TCP/IP Tunables The following TCP/IP tunables may be queried or set using ndd(1M). All tunables are global, i.e., they affect all TCP/IP connections. Note that some tunables take effect immediately, while others used to initialize TCP/IP connection will only affect newly opened connections. See Appendix B for the detailed description of these tunables. Tunable Name Description Reference tcp_conn_request_max Max number of outstanding connection requests 4.1.1.2 4.2.
PAGE 82
Tunable Name tcp_rexmit_interval_max Description Upper limit for computed round trip timeout Reference Appendix A tcp_rexmit_interval_min Lower limit for computed round trip timeout 4.3.1 Appendix A tcp_sack_enable Enable TCP Selective Acknowledgement (RFC 2018) 2.
PAGE 83
Tunable Name Description searched when resolving an address Reference ip_ire_status Displays all routing table entries ip_ire_cleanup_interval Timeout interval for purging routing entries ip_ire_flush_interval Routing entries deleted after this interval ip_ire_gw_probe Enable dead gateway probes ip_ire_gw_probe_interval Probe interval for Dead Gateway Detection ip_ire_pathmtu_interval Controls the probe interval for PMTU ip_pmtu_strategy Controls the Path MTU Discovery strategy Appendix A
PAGE 84
Tunable Name ip6_nd_advertise_count Description Controls the ND MAX_NEIGHBOR_ADVERTISEMENT ip6_nd_dad_solicit_count Controls the number of duplicate address detection ip6_nd_multicast_solicit_count Controls the ND MAX_MULTICAST_SOLICIT ip6_nd_probe_delay Controls the ND DELAY_FIRST_PROBE_TIME ip6_nd_transmit_interval Controls the ND RETRANS_TIMER ip6_nd_unicast_solicit_count Controls the ND MAX_UNICAST_SOLICIT ip6_rd_solicit_count Controls the ND MAX_RTR_SOLICITATIONS ip6_rd_solicit_delay Con
PAGE 85
Table 2: Operating System Support for TCP/IP Tunables Table 2 provides the following information about which versions of the operating system support the TCP/IP tunables described in this document. • An * (asterisk) specifies that the version supports the tunable and does not require any patch. • A - (dash) specifies that the version does not support the tunable.
PAGE 86
Tunable Name tcp_rexmit_interval_min 11i v1 * 11i v2 * 11i v3 * tcp_sack_enable * * * tcp_smoothed_rtt * * * tcp_sth_rcv_hiwat * * * tcp_sth_rcv_lowat * * * tcp_syn_rcvd_max * * * tcp_status * * * tcp_time_wait_interval * * * tcp_ts_enable * * * tcp_tw_cleanup_interval * * * tcp_xmit_hiwater_def * * * tcp_xmit_hiwater_lfp * * * tcp_xmit_hiwater_lnp * * * tcp_xmit_hiwater_max * * * tcp_xmit_lowater_def * * * tcp_xmit_lowater_lfp * * * tcp_xmit_l
PAGE 87
Tunable Name ip_pmtu_strategy 11i v1 * 11i v2 * 11i v3 * ip_reass_mem_limit * * * ip_send_redirects * * * ip_send_source_quench * * * ip_strong_es_model * * * ip6_def_hop_limit IPv6NCF11i depot * * ip6_fragment_timeout IPv6NCF11i depot * * ip6_icmp_interval IPv6NCF11i depot * * ip6_ill_status IPv6NCF11i depot * * ip6_ipif_status IPv6NCF11i depot * * ip6_ire_cleanup_interval IPv6NCF11i depot * * ip6_ire_hash IPv6NCF11i depot * * ip6_ire_pathmtu_interval IPv6NC
PAGE 88
Tunable Name 11i v1 11i v2 11i v3 socket_buf_max * * * socket_caching_tcp * * * socket_enable_tops Patch Level PHNE_33159 or higher Patch Level PHNE_33798 or higher Patch Level – PHNE_36281 or higher socket_msgeof - - Patch Level – PHNE 36281 or higher socket_qlimit_max * * * socket_udp_rcvbuf_default * * * socket_udp_sndbuf_default * * * Sockets: 88
PAGE 89
Revision History Periodically, this document is updated as new information becomes available. This document’s revision history is as follows: Version 1.0 – August, 2007 Version 1.1 – March, 2008 © 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services.