HP-UX TCP/IP Performance White Paper, March 2008

3.4.1 Configuration Scenario for Interrupt Migration

A significant amount of network protocol processing for inbound packets is done as part of the interrupt

from the network interface. In order to avoid a CPU bottleneck when there is heavy network traffic,

Interrupt Migration can be used to move interrupts away from heavily-loaded processors. Examples of this

load balancing could be to configure two busy network interfaces to interrupt separate processors, or to

schedule network interrupts away from a processor which is busy with unrelated application processing.

In the case of an IP subnet configured using Auto Port Aggregation (APA); maximum throughput can be

achieved by assigning interrupts for each interface in the aggregate to a separate processor.

The 10 Gigabit Ethernet driver (ixgbe) for HP-UX provides load balancing through the destination-port

based multiqueue feature. This allows multiple processors to be interrupted by the 10 Gigabit card, and

the incoming traffic can be separated into multiple flows based on the TCP destination port. Only TCP is

supported by the destination port multiqueue feature. This increases the maximum throughput of the 10

Gigabit card, which would otherwise be limited by the interrupt processing speed of a single CPU. The

"10GigEthr-00 (ixgbe) 10 Gigabit Ethernet Driver" release notes (http://docs.hp.com/en/J6379-

90003/J6379-90003.pdf) explains the configuration of the multiqueue feature.

3.4.2 Cache Affinity Improvement

Network protocols are layered, and data and control structures are shared between these layers. When

these structures are brought into a processor's cache, less time is spent stalling for cache misses as the

remaining protocol layers process the packet. Since interrupts for a NIC are bound to a processor, there is

even a good possibility that some structures will still be in the correct processor's cache when the next

packet for a given connection arrives.

However, when an application receives the data, there is the possibility of additional cache misses, as the

HP-UX scheduler assigns application threads to processors independently of the interrupt bindings. To get

the most efficient operation from a cache standpoint, it is beneficial to have the interrupt assigned where

the busiest applications are consuming the data. Using mpctl(2) on a per-application basis, and

optionally defining processor sets, applications can be restricted to run on specific processors.

If this does not result in a CPU bottleneck, then it is most efficient both for the application and from a system

wide perspective.

On the other hand, there is little cache sharing between network interfaces, so there will be little benefit

from cache affinity if multiple network interfaces interrupt the same processor.