Performance Tuning

ManualsBrandsIntel ManualsOtherIntel 10 Gigabit XF SR Server Adapter

10 Gb Ethernet

Mark Wagner

Senior Software Engineer, Red Hat

Summary of content (58 pages)

PAGE 1
0 Gb Ethernet Mark Wagner Senior Software Engineer, Red Hat
PAGE 2
10 Gb Ethernet - Overview  This presentation is about 10Gb Ethernet performance, tuning and functionality in RHEL5.X ● RHEL4.
PAGE 3
Some Quick Disclaimers  Test results based on two different platforms ●  Cards supplied by three different vendors ●   Chelsio, Intel, Neterion Red Hat supports all of devices used for this presentation ●  Intel, AMD We do not recommend one over the other Testing based on “performance mode” ● Maximize a particular thing at the expense of other things ● Not recommended for production Don't assume settings shown will work for you without some tweaks
PAGE 4
Take Aways   Hopefully, you will be able to leave this talk with: ● An understanding of the tools available to help you evaluate your network performance ● An understanding of 10GbE performance under RHEL5 Use this talk as suggestions of things to try ● My testing based on local network – wide area network will be different ● Do not assume all setting will work for you without some tweaks
PAGE 5
Take Aways - continued  Read the vendors Release Notes, tuning guides, etc ● Visit their website ● Install and read the source  /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.
PAGE 6
A Quick Example Internet search for “linux tcp_window_scaling performance” will show some sites say to set it to 0 others say set it to 1 [root@perf12 np2.4]# sysctl w net.ipv4.tcp_window_scaling=0 [root@perf12 np2.4]# ./netperf P1 l 30 H 192.168.10.100 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 [root@perf12 np2.4]# 30.00 2106.40 sysctl w net.ipv4.tcp_window_scaling=1 [root@perf12 np2.4]# .
PAGE 7
Platform Features  Multiple processors  Fast memory  All my testing has been done on PCIe ● PCI-X not really fast enough for full-duplex ● Look for width of 8 lanes (8x) ● PCIe  is typically 250 Gb/sec per lane Look for support of MSI-X interrupts ● Server ● OS (RHEL does this :) ● Driver
PAGE 8
Tools  Monitor / debug tools ● mpstat – reveals per cpu stats, Hard/Soft Interrupt usage ● vmstat – vm page info, context switch, total ints/s, cpu ● netstat – per nic status, errors, statistics at driver level ● lspci – list the devices on pci, indepth driver flags ● oprofile – system level profiling, kernel/driver code ● modinfo – list information about drivers, version, options ● sar ●  – collect, report, save system activity information Many others available- iptraf, wireshark, et
PAGE 9
Tools (cont)  Tuning tools ● ethtool – View and change Ethernet card settings ● sysctl – View and set /proc/sys settings ● ifconfig – View and set ethX variables ● setpci – View and set pci bus params for device ● netperf – Can run a bunch of different network tests ● /proc – OS info, place for changing device tunables
PAGE 10
ethtool  Works mostly at the HW level ● ethtool -S – provides HW level stats ● Counters ● since boot time, create scripts to calculate diffs ethtool -c - Interrupt coalescing ● ethtool -g - provides ring buffer information ● ethtool -k - provides hw assist information ● ethtool -i - provides the driver information
PAGE 11
ethtool c interrupt coalesce [root@perf10 ~]# ethtool c eth2 Coalesce parameters for eth2: Adaptive RX: off TX: off statsblockusecs: 0 sampleinterval: 0 pktratelow: 0 pktratehigh: 0 rxusecs: 5 rxframes: 0 rxusecsirq: 0 rxframesirq: 0 txusecs: 0 txframes: 0 txusecsirq: 0 txframesirq: 0
PAGE 12
ethtool g HW Ring Buffers [root@perf10 ~]# ethtool g eth2 Ring parameters for eth2: Preset maximums: RX: 16384 RX Mini: 0 RX Jumbo: 16384 TX: 16384 Current hardware settings: RX: 1024 RX Mini: 1024 RX Jumbo: 512 TX: 1024 Typically these numbers correspond the number of buffers, not the size of the buffer With some NICs creating more buffers decreases the size of each buffer which could add overhead
PAGE 13
ethtool k HW Offload Settings [root@perf10 ~]# ethtool k eth2 Offload parameters for eth2: Cannot get device udp large send offload settings: Operation not supported rxchecksumming: on txchecksumming: on scattergather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off These provide the ability to offload the CPU for calculating the checksums, etc.
PAGE 14
ethtool i driver information [root@perf10 ~]# ethtool i eth2 driver: cxgb3 version: 1.0ko firmwareversion: T 5.0.0 TP 1.1.0 businfo: 0000:06:00.0 [root@perf10 ~]# ethtool i eth3 driver: ixgbe version: 1.1.18 [root@dhcp47154 ~]# ethtool i eth2 driver: Neterion (ed. note s2io) version: 2.0.25.
PAGE 15
sysctl  sysctl is a mechanism to view and control the entries under the /proc/sys tree  sysctl -a - lists all variables  sysctl -q - queries a variable  sysctl -w - writes a variable ● When setting values, spaces are not allowed ● sysctl  w net.ipv4.conf.lo.arp_filter=0 Setting a variable via sysctl on the command line is not persistent The change is only valid until the next reboot ● Write entries into the /etc/sysctl.
PAGE 16
Some Important settings for sysctl  Already showed tcp_window_scaling issue  By default, Linux networking not tuned for max performance, more for reliability  ● Buffers are especially not tuned for local 10GbE traffic ● Remember that Linux “autotunes” buffers for connections ● Don't forget UDP ! Try via command line ●  When you are happy with the results, add to /etc/sysctl.conf Look at documentation in /usr/src ● /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.
PAGE 17
Some Important settings for sysctl  net.ipv4.tcp_window_scaling - toggles window scaling  Misc TCP protocol  ● net.ipv4.tcp_timestamps - toggles TCP timestamp support ● net.ipv4.tcp_sack - toggles SACK (Selective ACK) support TCP Memory Allocations - min/pressure/max ● net.ipv4.tcp_rmem - TCP read buffer - in bytes ● overriden ● by core.rmem_max net.ipv4.tcp_wmem - TCP write buffer - in bytes ● overridden ● by core/wmem_max net.ipv4.
PAGE 18
Some Important settings for sysctl  CORE memory settings ● net.core.rmem_max - max size of rx socket buffer ● net.core.wmem_max -max size of tx socket buffer ● net.core.rmem_default - default rx size of socket buffer ● net.core.wmem_default - default tx size of socket buffer ● net.core.optmem_max - maximum amount of option memory buffers  net.core.
PAGE 19
netperf  http://netperf.
PAGE 20
Know what you are testing  Linux has several automatic features that may cause unanticipated side effects ● ● Message delivery - Linux does its best to get message from A to B ● Packet may get from A to B via different path than you think ● Check arp_filter settings - sysctl -a | grep arp_filter Automatic buffer sizing ● Be explicit if it matters to you
PAGE 21
Control your network route :  Check arp_filter settings with sysctl ● sysctl -a | grep arp_filter ● A setting ● ● of 0 says uses any path If more than one path between machines, set arp_filter=1 Look for increasing interrupt counts in /proc/interrupt or increasing counters via ifconfig or netstat Lab Switch A 1GbE 10GbE B
PAGE 22
Know what you are testing - Hardware  Did the PCIe bus get negotiated correctly? ● Use  Did the interrupts come up as expected ● MSI-X can make a big difference ● On  lspci some cards its not on by default Several vendors have information on changing the default PCI-E settings via setpci ● Read the Release Notes / README !
PAGE 23
lspci – validate your slot setting for each NIC lspci v v s 09:00.0 09:00.
PAGE 24
Some General System Tuning Guidelines  To maximize network throughput lets ● Disable irqbalance service irqbalance stop – chkconfig irqbalance off – ● Disable cpuspeed ● default gov=ondemand, set governer to performance Use affinity to maximize what WE want ● ● Process affinity Use taskset or ● Interrupt affinity – MRG’s “Tuna” grep eth2 /proc/interrupts – echo 80 > /proc/irq/177/smp_affinity –
PAGE 25
Performance Tuning Outline  IRQ Affinity / Processor Affinity - No magic formula ● experiment to get the best results ● Interrupt coalescing  * My * experience is that chip architectures play a big role  Try to match TX and RX on same socket / data caches  sysctl.
PAGE 26
Actual Tuning Example  You just got those new 10GbE cards that you told the CIO would greatly improve performance  You plug them in and run a quick netperf to verify your choice
PAGE 27
New Boards, first Run # ./netperf P1 l 60 H 192.168.10.10 Recv Socket Size bytes Send Socket Size bytes 87380 16384 Send Message Size bytes Elapsed Time secs. Throughput 10^6bits/sec 16384 60.00 5012.
PAGE 28
New Boards, first run mpstat P ALL 5 Transmit CPU %sys %iowait all 2.17 0.00 0 17.40 0.00 1 0.00 0.00 2 0.00 0.00 3 0.00 0.00 4 0.00 0.00 5 0.00 0.00 6 0.00 0.00 7 0.00 0.00 %irq 0.35 2.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %soft 1.17 9.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %steal 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %idle 96.23 70.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Receive CPU %sys %iowait all 4.86 0.00 0 38.90 0.00 1 0.00 0.00 2 0.00 0.00 3 0.00 0.00 4 0.00 0.00 5 0.00 0.
PAGE 29
Tuning – Identify the Bottlenecks  Run ” mpstat P ALL 5” while running netperf  Review of output shows core0 on Receive side is pegged  So lets try to set some IRQ affinity ● grep for the NIC in /proc/interrupts ● Echo the desired value into /proc/irq/XXX/smp_affinity ● Does NOT persist across reboots
PAGE 30
Setting IRQ Affinity  CPU cores designated by bitmap  cat /proc/cpuinfo to determine how the BIOS presented the CPUs to the system ● Some go Socket0, core0, socket1, core0 ● Others go Socket0, core0, socket0, core1  Understand the layout of L2 cache in relationship to the cores  Remember these values do not persistent across reboots!  Set IRQ affinity ● echo 80 > /proc/irq/192/smp_affinity ● Use “TUNA”
PAGE 31
Know Your CPU core layout # cat /proc/cpuinfo processor physical id core id : 0 : 0 : 0 processor physical id core id : 1 : 1 : 0 processor physical id core id : 2 : 0 : 1 processor physical id core id : 3 : 1 : 1 processor physical id core id : 4 : 0 : 2 processor physical id core id : 5 : 1 : 2 processor physical id core id : 6 : 0 : 3 processor physical id core id : 7 : 1 : 3 Socket 0 Socket1
PAGE 32
Setting IRQ Affinity Now lets move the interrupts, remember your core mapping is important Note that the separate irq for TX and RX Transmit # grep eth2 /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 74: 360345 0 0 0 0 0 82: 647960 0 0 0 0 0 90: 0 0 0 0 0 0 # echo 40 > /proc/irq/74/smp_affinity # echo 80 > /proc/irq/82/smp_affinity CPU6 0 0 0 CPU7 0 0 0 CPU6 0 0 CPU7 0 PCIMSIX eth2 0 PCIMSIX eth2(queue0) PCIMSIX PCIMSIX PCIMSIX eth2tx0 eth2rx0 eth2lsc Receive # grep eth2 /proc/interrupts
PAGE 33
Tuning – Run2, IRA Affinity # ./netperf P1 l 30 H 192.168.10.10 Recv Socket Size bytes Send Socket Size bytes 87380 16384 Send Message Size bytes Elapsed Time secs. Throughput 10^6bits/sec 16384 30.00 5149.
PAGE 34
Run 2 – mpstat P ALL 5 outputs Transmit CPU %sys %iowait all 2.35 0.00 0 0.00 0.00 1 0.00 0.00 2 0.00 0.00 3 0.00 0.00 4 0.00 0.00 5 0.00 0.00 6 18.60 0.00 7 0.00 0.00 %irq 0.43 0.00 0.00 0.00 0.00 0.00 2.80 0.60 0.00 %soft 1.43 0.00 0.00 0.00 0.00 0.00 2.00 9.40 0.00 %steal 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %idle 95.75 100.00 100.00 100.00 100.00 100.00 95.20 70.80 100.00 intr/s 12387.80 1018.00 0.00 0.00 0.00 0.00 4003.80 7366.40 0.00 Receive CPU %sys %iowait all 4.67 0.00 0 0.00 0.
PAGE 35
Run 2 – review data – next steps  We moved interrupts and reran the test, ● saw a very slight improvement in throughput, ● not  really expecting much...yet Looking at the mpstat output we can see that we are still bottlenecked on the receive side.
PAGE 36
Run 3 – Add Process Affinity # ./netperf P1 l 30 H 192.168.10.10 T 5,5 Recv Socket Size bytes Send Socket Size bytes 87380 16384 Send Message Size bytes Elapsed Time secs. Throughput 10^6bits/sec 16384 30.00 4927.
PAGE 37
Run3 Bottlenecks Transmit CPU %sys %iowait all 4.35 0.00 0 0.00 0.00 1 0.00 0.00 2 0.00 0.00 3 0.00 0.00 4 0.00 0.00 5 34.73 0.00 6 0.00 0.00 7 0.00 0.00 %irq 0.35 0.00 0.00 0.00 0.00 0.00 2.59 0.40 0.00 %soft 16.00 0.00 0.00 0.00 0.00 0.00 62.08 65.60 0.00 %steal 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %idle 79.25 100.00 100.00 100.00 100.00 100.00 0.00 34.00 100.00 Receive CPU %sys %iowait all 3.85 0.00 0 0.00 0.00 1 0.00 0.00 2 0.00 0.00 3 0.00 0.00 4 0.00 0.00 5 30.90 0.00 6 0.00 0.00 7 0.
PAGE 38
Run3, Analysis + next steps  By adding process affinity things have changed ● The bottleneck is now on the transmit side ● core5 on TX is 100%, core5 on RX side is handling the load  Try moving process affinities around (already done)  Change the code (if you can)  ● Default netperf method uses send() which copies data around ● Try TCP_SENDFILE which use the sendfile() system call Try bigger MTU ● Currently at 1500
PAGE 39
Run 4 – Change send() to sendfile() #./netperf P1 l 30 H 192.168.10.10 T 5,5 t TCP_SENDFILE F /data.file Recv Socket Size bytes Send Socket Size bytes 87380 16384 Send Message Size bytes Elapsed Time secs. Throughput 10^6bits/sec 16384 30.00 6689.
PAGE 40
Run 4 – mpstat output – sendfile option Transmit CPU %sys %iowait all 1.55 0.00 0 0.00 0.00 1 0.00 0.00 2 0.00 0.00 3 0.00 0.00 4 0.00 0.00 5 12.38 0.00 6 0.00 0.00 7 0.00 0.00 %irq 0.38 0.00 0.00 0.00 0.00 0.00 2.20 0.80 0.00 %soft 7.30 0.00 0.00 0.00 0.00 0.00 14.17 44.20 0.00 %steal 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %idle 90.70 100.00 100.00 100.00 100.00 100.00 70.66 55.00 100.00 intr/s 12645.00 1018.20 0.00 0.00 0.00 0.00 3973.40 7653.20 0.00 Receive CPU %sys %iowait all 5.73 0.31 0 0.
PAGE 41
Tuning - Identifying the Bottlenecks  Wow, our core with the netperf process on TX went from 0% idle to 70% idle by switching the system call ● Overall the system reclaimed 10%  Nice jump in throughput, but we are still not near 10Gb  Let's try a larger MTU ● ifconfig eth2 mtu 9000 up ● Note this causes a temporary drop in the connection
PAGE 42
Run 5 – Kick up MTU = 9000 #./netperf P1 l 30 H 192.168.10.10 T 5,5 t TCP_SENDFILE F /data.file Recv Socket Size bytes Send Socket Size bytes 87380 16384 Send Message Size bytes Elapsed Time secs. 16384 30.00 Throughput 10^6bits/sec 9888.
PAGE 43
Run 5 – Kick up MTU = 9000 TX CPU all 0 1 2 3 4 5 6 7 %sys %iowait 1.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.00 0.00 0.00 0.00 0.00 0.00 %irq 0.32 0.00 0.00 0.00 0.00 0.00 2.20 0.40 0.00 %soft 4.27 0.00 0.00 0.00 0.00 0.00 7.40 26.60 0.00 %steal 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %idle 93.90 100.00 100.00 100.00 100.00 100.00 78.80 73.00 100.00 intr/s 13025.80 1015.00 0.00 0.00 0.00 0.00 4003.80 8007.20 0.00 RX CPU all 0 1 2 3 4 5 6 7 %sys %iowait 6.63 0.00 0.00 0.
PAGE 44
Features - Multi-queue  RHEL5 has support for several vendors RX multi-queue ● Typically enabled via script or module parameters  Still no TX multi-queue (that I know of at least)  RX Multi-queue big gainer when there are multiple applications using the network ● Use ● affinity for queues and match queue to task (taskset) Potential advantage with single version of netperf if you have slower CPUs / memory
PAGE 45
Single RX Queue – Multiple netperf 1016.95 819.41 898.93 961.87 3696 CPU all 0 1 2 3 4 5 6 7 192.168.10.37 192.168.10.12 192.168.10.17 192.168.10.16 %sys %iowait 3.90 0.00 0.00 0.00 0.20 0.00 8.20 0.00 8.22 0.00 0.00 0.00 7.40 0.00 7.21 0.00 0.00 0.00 %irq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %soft 23.63 %steal 0.00 100.00 0.00 0.20 0.00 19.00 0.00 24.05 0.00 0.00 0.00 24.80 0.00 21.24 0.00 0.00 0.00 %idle 72.44 0.00 99.60 72.80 67.74 100.00 67.80 71.54 100.00 intr/s 1054.
PAGE 46
Multiple RX Queues, multiple netperfs 1382.25 2127.18 1726.71 1986.31 7171 CPU all 0 1 2 3 4 5 6 7 192.168.10.37 192.168.10.17 192.168.10.16 192.168.10.12 %sys %iowait 6.55 0.00 2.40 0.00 11.45 0.00 2.00 0.00 0.00 0.00 10.60 0.00 0.00 0.00 11.22 0.00 14.77 0.00 %irq 0.18 0.00 0.40 0.20 0.00 0.00 0.00 0.00 0.80 %soft 42.84 6.60 83.13 97.80 0.00 36.20 0.00 35.87 83.03 %steal 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %idle 50.44 91.00 5.02 0.00 100.00 53.20 100.00 52.91 1.20 intr/s 28648.
PAGE 47
Interrupt distribution w/ MultiQueue []# grep eth2 /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 130: 5 0 0 0 0 0 0 0 138:24 11032798 0 0 0 0 0 0 146: 1 0 1821803 0 0 0 0 0 154: 1 0 0 5242518 0 0 0 0 162: 1 0 0 0 1849812 0 0 0 170: 1 0 0 0 0 7301950 0 0 178: 1 0 0 0 0 0 8426940 0 186: 1 0 0 0 0 0 0 1809018 eth2 eth2 eth2 eth2 eth2 eth2 eth2 eth2 q0 q1 q2 q3 q4 q5 q6
PAGE 48
MRG and 10GbE  A different ballgame  Latency often trumps throughput ●  But most people want both Messaging tends to leverage UDP and small packets ● Predictability is important
PAGE 49
New Tuning Tools w/ RH MRG MRG Tuning – using the TUNA – dynamically control Device IRQ properties CPU affinity / parent and threads Scheduling policy
PAGE 50
New Tuning Tools w/ RH MRG MRG Tuning – using the TUNA – dynamically control Process affinity / parent and threads Scheduling policy
PAGE 51
Tuning Network Apps Messages/sec 10 Gbit Nics Stoakley 2.67 to Bensley 3.
PAGE 52
Latency  The interrupt coalescing settings are vital ● ethtool -c eth3 to read , -C to set ● Rx-usecs ●  tells how often to service interrupt NAPI may help as you can handle multiple interrupts Also look at using TCP_NODELAY options ● May help with latency but hurt throughput ● “Nagles Algorithm” tries to fill packets in order to avoid overhead of sending
PAGE 53
Lower RX Latency with ethtool C # ethtool c eth6 Coalesce parameters for eth6: rxusecs: 125 rxframes: 0 rxusecsirq: 0 rxframesirq: 0 # ./netperf H 192.168.10.12 t TCP_RR Local /Remote Socket Size Request Send Recv Size bytes Bytes bytes Resp. Size bytes Elapsed Time secs. Trans. Rate per sec 16384 1 10.00 8000.27 87380 1 Lower rx-usecs on the receiver and rerun # ethtool C eth6 rxusecs 100 # ./netperf H 192.168.10.12 t TCP_RR 16384 87380 1 1 10.00 10009.
PAGE 54
10GbE in a Virtual World  Single guest w/ 1Gb interface – Good  Single guest w/ 10 Gb interface - Better than 1Gb ●  Several copies needed for security ● cpu/memory BW limits 10Gbit performance ● Better than 1Gb network but not wire speed. ● Same speed as using a dummy network Use RHEL5.
PAGE 55
10 GbE Scaling w/Xen on RHEL5.
PAGE 56
10GbE in a Virtual World - cont  Going forward: ● PCI_passthru looks promising for performance ● Guest ● has direct control over NIC Network throughput for fully-virt/HVM guests will be limited to 100Mb/s.
PAGE 57
Wrap up  There are lots of knobs to use, the trick is finding them and learning how to use them  Learn to use the tools to help investigate issues  Full-Duplex (20Gb/sec) is possible under RHEL5  Questions ?
PAGE 58
Credits  The following helped me along the way in preparing this presentation, to them a special thank you ● Andy Gospo ● Don Dutile ● D John Shakshober ● JanMark Holzer