Did ZFS tuning 5 years ago, see the link ZFS configuration and tuning example on SUN Fire x4540, Now turning them to Linux after warranty expired, so time to do tuning again.

ZOL is a bit different with Solaris ZFS now, and now still focusing on functionality rather than performance(as of Oct. 2014), so there are some room to play.

Overall, the fundamental of ZFS are same, so I suggest befor you get started working on your tuning, you also read this ZFS tunable parameters.

 

I'm not going to go through theory again, instead, I'll use a real case to show how it works in each tuning step.

Test environment:

Hardware: Sun fire X4540, 12 CPUs, 32 GB memory, 6 SAS internal SAS controllers, each has 8 hard drives connected to. Total 48 drives.

Using ZFS vdev naming on Linux, I created 4 ZFS pools, 2 has 12 hard drives, 2 has 11 hard drives(2 for OS, using Linux software array).

Here is the command created the pool

zpool create -f -m /zpool_3 zpool_3  raidz2 \
          disk/by-vdev/c0t4 disk/by-vdev/c0t5 disk/by-vdev/c1t4 \
disk/by-vdev/c1t5 disk/by-vdev/c2t4 disk/by-vdev/c2t5 \
disk/by-vdev/c3t4 disk/by-vdev/c3t5 disk/by-vdev/c4t4 \
disk/by-vdev/c4t5 disk/by-vdev/c5t4 disk/by-vdev/c5t5

ARC tuning

This setting is done in the /etc/modprobe.d/zfs.conf file.

For example, should you want the ARC never to exceed 24 GB, add this line:

options zfs zfs_arc_max=25769803776

The number is for my server(32GB memory), which is a file server, no other application running on it. So leave 8GB to OS and cluster file system.

I have another article dedicated described how set, monitor ZFS ARC

ZFS File level prefetch tuning

Not many parameters you can play with, here are some. Set in /etc/modprobe.d/zfs.conf

options zfs zfs_prefetch_disable=1

ZFS Device level prefetch tuning

options zfs zfs_vdev_cache_size=1310720
options zfs zfs_vdev_cache_max=131072
options zfs zfs_vdev_cache_bshift=17

In my test case, the ZFS record size is set to 128k, so 131072=128k. In the setting above, when an io size < zfs_vdev_cache_max, it will get inflated to zfs_vdev_cache_bshift, which is the base 2 logarithm of  the size used to read disks, default is 16, which 64k.

The inflated io data is stored in a 10MB LRU per-vdev cache(zfs_vdev_cache_size), which can short-cut the ZIO pipeline if present in cache. Default is 10MB, in my test, I did not see big improvement if I set it too high.

Chunk size tuning

options zfs zfs_read_chunk_size=1310720

131072=128k, equals ZFS record size.

Max Pendings io on a device

For example, the zfs_vdev_sync_read_max_active set the value for the
 number of I/Os that can be pending on a device. If there are more than the
 max_active ops already queued up, beyond those already issued to the vdev,
 then use zone-based scheduling to get the next synchronous zio.

So, here is my tuning, default is 10

options zfs zfs_vdev_async_read_max_active=12
options zfs zfs_vdev_async_read_min_active=12
options zfs zfs_vdev_async_write_max_active=12
options zfs zfs_vdev_async_write_min_active=12
options zfs zfs_vdev_sync_read_max_active=12
options zfs zfs_vdev_sync_read_min_active=12
options zfs zfs_vdev_sync_write_max_active=12
options zfs zfs_vdev_sync_write_min_active=12

The reason is that, My application is a high io throughput focused, single file system on each ZFS pool. So I chose set it to higher number. It may impact writing performance if you set them too high.

In old ZFS version, the number was set 35, which allows common SCSI and SATA disks to reach their maximum throughput under ZFS. However, having 35 concurrent I/Os means that the service times can be inflated for read workloads.

Test result:

Iozone test result:

    Command line used: /usr/bin/iozone -s96g -i 0 -i 1 -j 10  -t 1
"Throughput report Y-axis is type of test X-axis is number of processes"

"Record size = 4 Kbytes"

"Output is in Kbytes/sec"

"  Initial write "  342702.56

"        Rewrite "   19251.95

"           Read "  290888.47

"        Re-read "  294238.59


Using 128k record size.

    Command line used: /usr/bin/iozone -s96g -i 0 -i 1 -j 10  -r128 -t 1

"Throughput report Y-axis is type of test X-axis is number of processes"

"Record size = 128 Kbytes"

"Output is in Kbytes/sec"

"  Initial write "  734676.94

"        Rewrite "  755937.00

"           Read "  369469.12

"        Re-read "  372136.12

Tuned performance reusult:

   Command line used: /usr/bin/iozone -s96g -i 0 -i 1 -j 10  -r128 -t 1 

"Throughput report Y-axis is type of test X-axis is number of processes"

"Record size = 128 Kbytes"

"Output is in Kbytes/sec"

"  Initial write "  726571.69

"        Rewrite "  740502.06

"           Read "  575197.38

"        Re-read "  562190.00

 

 More tunable parameters

 On Linux, you can explore and try to tune more parameters, see /sys/module/zfs/parameters

# ls /sys/module/zfs/parameters/
l2arc_feed_again         zfs_prefetch_disable
l2arc_feed_min_ms         zfs_read_chunk_size
l2arc_feed_secs             zfs_read_history
l2arc_headroom             zfs_read_history_hits
l2arc_headroom_boost         zfs_recover
l2arc_nocompress         zfs_resilver_delay
l2arc_noprefetch         zfs_resilver_min_time_ms
l2arc_norw             zfs_scan_idle
l2arc_write_boost         zfs_scan_min_time_ms
l2arc_write_max             zfs_scrub_delay
metaslab_debug_load         zfs_send_corrupt_data
metaslab_debug_unload         zfs_sync_pass_deferred_free
spa_asize_inflation         zfs_sync_pass_dont_compress
spa_config_path             zfs_sync_pass_rewrite
zfetch_array_rd_sz         zfs_top_maxinflight
zfetch_block_cap         zfs_txg_history
zfetch_max_streams         zfs_txg_timeout
zfetch_min_sec_reap         zfs_vdev_aggregation_limit
zfs_arc_grow_retry         zfs_vdev_async_read_max_active
zfs_arc_max             zfs_vdev_async_read_min_active
zfs_arc_memory_throttle_disable  zfs_vdev_async_write_active_max_dirty_percent
zfs_arc_meta_limit         zfs_vdev_async_write_active_min_dirty_percent
zfs_arc_meta_prune         zfs_vdev_async_write_max_active
zfs_arc_min             zfs_vdev_async_write_min_active
zfs_arc_min_prefetch_lifespan     zfs_vdev_cache_bshift
zfs_arc_p_aggressive_disable     zfs_vdev_cache_max
zfs_arc_p_dampener_disable     zfs_vdev_cache_size
zfs_arc_shrink_shift         zfs_vdev_max_active
zfs_autoimport_disable         zfs_vdev_mirror_switch_us
zfs_dbuf_state_index         zfs_vdev_read_gap_limit
zfs_deadman_enabled         zfs_vdev_scheduler
zfs_deadman_synctime_ms         zfs_vdev_scrub_max_active
zfs_dedup_prefetch         zfs_vdev_scrub_min_active
zfs_delay_min_dirty_percent     zfs_vdev_sync_read_max_active
zfs_delay_scale             zfs_vdev_sync_read_min_active
zfs_dirty_data_max         zfs_vdev_sync_write_max_active
zfs_dirty_data_max_max         zfs_vdev_sync_write_min_active
zfs_dirty_data_max_max_percent     zfs_vdev_write_gap_limit
zfs_dirty_data_max_percent     zfs_zevent_cols
zfs_dirty_data_sync         zfs_zevent_console
zfs_disable_dup_eviction     zfs_zevent_len_max
zfs_expire_snapshot         zil_replay_disable
zfs_flags             zil_slog_limit
zfs_free_min_time_ms         zio_bulk_flags
zfs_immediate_write_sz         zio_delay_max
zfs_mdcomp_disable         zio_injection_enabled
zfs_no_scrub_io             zio_requeue_io_start_cut_in_line
zfs_no_scrub_prefetch         zvol_inhibit_dev
zfs_nocacheflush         zvol_major
zfs_nopwrite_enabled         zvol_max_discard_blocks
zfs_pd_blks_max             zvol_threads