HP-UX TCP/IP Performance White Paper, March 2008

13
3.4.1 Configuration Scenario for Interrupt Migration
A significant amount of network protocol processing for inbound packets is done as part of the interrupt
from the network interface. In order to avoid a CPU bottleneck when there is heavy network traffic,
Interrupt Migration can be used to move interrupts away from heavily-loaded processors. Examples of this
load balancing could be to configure two busy network interfaces to interrupt separate processors, or to
schedule network interrupts away from a processor which is busy with unrelated application processing.
In the case of an IP subnet configured using Auto Port Aggregation (APA); maximum throughput can be
achieved by assigning interrupts for each interface in the aggregate to a separate processor.
The 10 Gigabit Ethernet driver (ixgbe) for HP-UX provides load balancing through the destination-port
based multiqueue feature. This allows multiple processors to be interrupted by the 10 Gigabit card, and
the incoming traffic can be separated into multiple flows based on the TCP destination port. Only TCP is
supported by the destination port multiqueue feature. This increases the maximum throughput of the 10
Gigabit card, which would otherwise be limited by the interrupt processing speed of a single CPU. The
"10GigEthr-00 (ixgbe) 10 Gigabit Ethernet Driver" release notes (http://docs.hp.com/en/J6379-
90003/J6379-90003.pdf) explains the configuration of the multiqueue feature.
3.4.2 Cache Affinity Improvement
Network protocols are layered, and data and control structures are shared between these layers. When
these structures are brought into a processor's cache, less time is spent stalling for cache misses as the
remaining protocol layers process the packet. Since interrupts for a NIC are bound to a processor, there is
even a good possibility that some structures will still be in the correct processor's cache when the next
packet for a given connection arrives.
However, when an application receives the data, there is the possibility of additional cache misses, as the
HP-UX scheduler assigns application threads to processors independently of the interrupt bindings. To get
the most efficient operation from a cache standpoint, it is beneficial to have the interrupt assigned where
the busiest applications are consuming the data. Using mpctl(2) on a per-application basis, and
optionally defining processor sets, applications can be restricted to run on specific processors.
If this does not result in a CPU bottleneck, then it is most efficient both for the application and from a system
wide perspective.
On the other hand, there is little cache sharing between network interfaces, so there will be little benefit
from cache affinity if multiple network interfaces interrupt the same processor.