Spent a lot of time on TCP tuning for SL6, kernel 2.6.32-358.14.1.el6.x86_64, the been using most of parameters since then until 6.5, so share the experience here.

In the past, I played with several types of 10G NIC, all on SL5, only some of them survived from my test, they fail at either at poor performance, or data corruption during multiple streams transfers.

 

To be noted my test is multiple stream test for storage nodes, receive and deliver data in a large range of RTT(0.1 to 300ms), clients mixed with 1G and 10G NIC.


In my recent test, I used the node has 32GB memory, Mellanox 10G NIC, 12 CPUs, it was a SL5 node, just upgraded to SL6. I mounted two LUNs so it has enough I/O bandwith for the test.

The first driver I tested was 2.0 which came with SL6.4, it was not succeful, it cashed the kernel in 3 minutes with the following error in kernel.

kernel: swapper: page allocation failure. order:2, mode:0x4020
kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.14.1.el6.x86_64 #1
kernel: Call Trace:
kernel: <IRQ>  [<ffffffff8112c197>] ? __alloc_pages_nodemask+0x757/0x8d0
kernel: [<ffffffff8147fa38>] ? ip_local_deliver+0x98/0xa0
kernel: [<ffffffff811609ea>] ? alloc_pages_current+0xaa/0x110
kernel: [<ffffffffa01efaa7>] ? mlx4_en_alloc_frags+0x57/0x330 [mlx4_en]
kernel: [<ffffffff8144aada>] ? napi_frags_finish+0x9a/0xb0
kernel: [<ffffffffa01f02df>] ? mlx4_en_process_rx_cq+0x55f/0x990 [mlx4_en]
kernel: [<ffffffffa01f074f>] ? mlx4_en_poll_rx_cq+0x3f/0x80
...

Then, I tried version 1.5.10, which also generates some memory allocation errors, but with some further tunings, it passed my stress tests. Performance is also very good.

 

Tunning in sysctl.conf

# Extra tuning starts here:
#
vm.min_free_kbytes = 524288
#
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_max_syn_backlog = 8192

net.core.optmem_max = 33554432
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.core.rmem_default = 80370
net.core.wmem_default = 65536
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.ipv4.tcp_mem = 6672016 6682016 7185248

To be noted is that, I increased tcp_mem to let TCP has more memory, for my data server is mainly being used for data transfer. So, if you have a server also doing something else, then probably you should lower the number for other applications.

 

For SACK and timestamps

They were blamed for their too much CPU cost. However, after kernel 2.6.25, there are lots of patches for SACK to save avoid too much CPU usage. I did not see significent CPU usage under stress test.

 

Higher syn_backlog,max_backlog

I set syn_backlog, max_backlog to higher number mainly because there could be short time high rate data taking, but I did not set txqueuelen to higher(default is 1000), I did not set Mellanox adaptive-rx to off either. Try the setting if you server traffic patten changes all the time. For example, sometime quiet, then sometime very busy.

 

General suggestion

Carefully chose driver(native or the latest driver may not yield good result), then leave sysctl.conf empty, and check kernel picked number, then start from there.

On different hardware, memory setting may very different, however, the following parameter always good for high throughput environment.(swich low_latency to off if not applicable to you environment)

vm.min_free_kbytes = 524288
#
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_max_syn_backlog = 8192

In most cases, kernel is capable to pick the best number for memory related parameters, if not, tweak them a bit.

Note: you diniftely need to increase min_free_kbytes if you see memory allocation error in /var/log/message.

 

Some other options may related with 10G NIC performance

Note: The following options are managed/controlled by 10G network card driver, and nowadays most of driver are already tunned according to your host configuration.

 

Network card driver option -- Adaptive RX/TX

The network driver uses adaptive interrupt moderation for the receive path, which adjusts the moderation time to the traffic pattern, use netstat

How to query run:

#ethtool -c eth<x>

To change it run:

#ethtool -C eth<x> adaptive-rx on|off

rx-usec and rx-frames

To set interrupt coalescing settings when adaptive moderation is disabled, use:

# ethtool -c eth<x> [rx-usecs N] [rx-frames N]

Note: usec settings correspond to the time to wait after the *last* packet is sent/received before triggering an interrupt.

Offload feature

To query stateless offload status run:

# ethtool -k eth<x> 

To set stateless offload status run:

# ethtool -K eth<x> [rx on|off] [tx on|off] [sg on|off] [tso on|off]

Ring size

To query ring size values run:

# ethtool -g eth<x>

To modify rings size run:

# ethtool -G eth<x> [rx <N>] [tx <N>]

Number of ring entries

To query ring entries run:

#ethtool -l eth<x>

To set ring entries run:

#ethtool -L eth<x>  [rx N] [tx N] [other N] [combined N]

Note: some network card driver don't support 'number of ring entries' and 'ring size' operations. And, when you change two options, mostly both value can't be set to max value at the same time according to Hardware and driver limit.

Good references

Tuning could be very different according to applications, here are some good references with very good explainations.

http://www.acc.umu.se/~maswan/linux-netperf.txt

http://fasterdata.es.net/host-tuning/linux/
http://en.wikipedia.org/wiki/TCP_window_scale_option
http://www.psc.edu/index.php/networking/641-tcp-tune

http://man7.org/linux/man-pages/man7/tcp.7.html
https://www.frozentux.net/ipsysctl-tutorial/ipsysctl-tutorial.html#TCPVARIABLES
http://en.wikipedia.org/wiki/Transmission_Control_Protocol

http://www.linuxvox.com/2009/11/what-is-the-linux-kernel-parameter-tcp_low_latency
http://www.ibm.com/developerworks/library/l-tcp-sack/

More references for other platforms

AIX:  For more information, see section 4.6 in the http://www.redbooks.ibm.com/redbooks/SG247347/wwhelp/wwhimpl/js/html/wwhelp.htm document.

In addition, see the http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/tcp_streaming_workload_tuning.htm document.


HP-UX: For more information, see thendd command information in the following documents:

    http://docs.hp.com/en/B2355-91020/B2355-91020.pdf
    http://docs.hp.com/en/TKP-90203/index.html

HP-UX: Also, see the _recv_hiwater_def and tcp_xmit_hiwater_def parameter information in the following document: http://docs.hp.com/en/11890/perf-whitepaper-tcpip-v1_1.pdf


Linux: For more information, see the following documents:

    http://www.ibm.com/developerworks/linux/library/l-hisock.html
    http://fasterdata.es.net/TCP-tuning/linux.html
    http://www.onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html?page=2

Solaris For more information, see section 2.2 in the following document: http://www.redbooks.ibm.com/redbooks/SG247584/wwhelp/wwhimpl/java/html/wwhelp.htm