I'm not a big fan of Linux Software Array, mostly just use it for local disks, not for disks using for cluster file systems.

Recently, I got to try ZFS on Linux, we have lots of Sun fire x4540 retired, we wnat to reuse them until they die. So, we reinsall them to Linux,  ZFS is the primary option to choose, md is also an option.

Here is the HW information:

Each node has 48 1 or 2 TB disks connected to 6 SAS controllers. 12 CPUs and 32 or 64 GB memory on each node. Another article has detail info about the server ZFS configuration and tuning example on SUN fire x4540

 

Here is the controller information, pretty good internal SAS controller, which support 3Gb/sec throughput.

 #lspci | grep SAS
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 04)
03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 04)
04:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 04)
41:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 04)
42:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 04)
43:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 04)

Note: In the case below, I focused on throughput test, not for DB application, or other random data access scenario, however, you may get benefit from the ideas and where to start to tuning your own node.

 

Create the array(raid6).

I use 12 disks in one array, spread 2 disks on each controller, so that no single controller failure can break the array.

 

# mdadm -Cv /dev/md5 -l6 -n12 -c128 /dev/disk/by-vdev/c0t6 /dev/disk/by-vdev/c0t7 \
/dev/disk/by-vdev/c1t6 /dev/disk/by-vdev/c1t7 /dev/disk/by-vdev/c2t6 /dev/disk/by-vdev/c2t7 \
/dev/disk/by-vdev/c3t6 /dev/disk/by-vdev/c3t7 /dev/disk/by-vdev/c4t6 /dev/disk/by-vdev/c4t7 \
/dev/disk/by-vdev/c5t6 /dev/disk/by-vdev/c5t7

The device mapping comes from Linux vdev naming configuration for SUN fire x4540

 

Initial performance result.

    Command line used: /usr/bin/iozone -s96g -i 0 -i 1 -j 10 -r128k -t 1 -b /home/trteam/iozonetest/iozone_zpool12_4_2014-10-17-12-21.xls
    Output is in Kbytes/sec
    Time Resolution = 0.000001 seconds.
    Processor cache size set to 1024 Kbytes.
    Processor cache line size set to 32 bytes.
    File stride size set to 10 * record size.
    Throughput test with 1 process
    Each process writes a 100663296 Kbyte file in 128 Kbyte records

 ...

"Throughput report Y-axis is type of test X-axis is number of processes"
"Record size = 128 Kbytes "
"Output is in Kbytes/sec"

"  Initial write "  352233.91

"        Rewrite "  365875.84

"           Read "  478708.47

"        Re-read "  484263.88

 

Tuning for writing, Set stripe_cache_size.

 

This is only available on raid5 and raid6, it not just boost writing, it also can increase reading and sync performance. It records the size (in KB per device) of the stripe cache which used for write,read and rsync. The default is 256. Valid values are 17 to 32768. Chose it carefully for in may contribute more memory usage.

Back to my case, I set to 32MB for each device in the array, I also tried other size, I see a bit performance drop on 16MB, and only a tiny bit increase on 64MB. So, 32MB seems good setting.

#echo 32768 > /sys/block/<md-dev>/md/stripe_cache_size

 

Another option I really want to set is max_sectors_kb setting for md device, but the paramater doesn't accept my  setting.

 

Tuning disk io size for md device, max_sectors_kb.

#echo 1280 > /sys/block/<md-dev>/queue/max_sectors_kb

Why 1280 ?

The array has 12 disks in one RAID6, chunk size 128k, thus 10*128=1280

Tuning for reading, set read ahead buffer.

#blockdev --setra 65536 /dev/md5

Set read ahead buffer to 32MB. Also tried other numbers, but seems this number is the best.

Tuning result.

    Command line used: /usr/bin/iozone -s96g -i 0 -i 1 -j 10 -r128k -t 1 -b /home/trteam/iozonetest/iozone_zpool12_4_2014-10-20-13-24.xls
    Output is in Kbytes/sec
    Time Resolution = 0.000001 seconds.
    Processor cache size set to 1024 Kbytes.
    Processor cache line size set to 32 bytes.
    File stride size set to 10 * record size.
    Throughput test with 1 process
    Each process writes a 100663296 Kbyte file in 128 Kbyte records

  ...

"Throughput report Y-axis is type of test X-axis is number of processes"
"Record size = 128 Kbytes "
"Output is in Kbytes/sec"

"  Initial write "  663349.75

"        Rewrite "  663594.38

"           Read "  798043.06

"        Re-read "  844228.69

 

Speed up rebuilding, Set speed limit

The /proc/sys/dev/raid/speed_limit_min is config file that reflects the current "goal" rebuild speed for times when non-rebuild activity is current on an array. The speed is in Kibibytes per second (1 kibibyte = 210 bytes = 1024 bytes), and is a per-device rate, not a per-array rate . The default is 1000.

The /proc/sys/dev/raid/speed_limit_max is config file that reflects the current "goal" rebuild speed for times when no non-rebuild activity is current on an array. The default is 100,000.

# cat /proc/sys/dev/raid/speed_limit_max
200000
# cat /proc/sys/dev/raid/speed_limit_min
1000

To overwrite default numbers, add lines below into /etc/sysctl.conf

dev.raid.speed_limit_min = 50000
## good for 4-5 disks based array ##
dev.raid.speed_limit_max = 2000000
## good for large 6-12 disks based array ###
dev.raid.speed_limit_max = 50000000

I also notiecd that some articles mentioned the following change options, but I don't see much help on my test bed, Maybe help other people's help.

Disable NCQ on all disks

NCQ, Native Command Queuing, default is 64. theoritically, when rebuilding, there is no need to reording the commands, thus disable the queue makes sense.

To disable NCQ for a device, run

echo 1 > /sys/block/<dev/device/queue_depth

Bitmap Option

Bitmaps optimize rebuild time after a crash, or after removing and re-adding a device. Turn it on by typing the following command:
# mdadm --grow --bitmap=internal /dev/<md-dev>
Once array rebuild or fully synced, disable bitmaps:
# mdadm --grow --bitmap=none /dev/<md-dev>