I found this article is really good, share it here.

Interpreting iostat Output

Kyle Brandt

In this post I am going to explore how extended iostat statistics can be useful to a system administrator beyond a binary “Disk is bottleneck / Disk is not bottleneck.” Before we can get to any of that however, we must make sure we have a basic background knowledge of the Disk IO Subsystem.

 

Linux Disk IO Subsystem:

 

I am not a kernel hacker, so this overview might be flawed in parts but hopefully it is accurate enough to give the background needed for analyzing the output of iostat.

 

Layer

Unit

Typical Unit Size

User Space System Calls

read() , write()

 

Virtual File System Switch (VFS)

Block

4096 Bytes

Disk Caches

Page

 

Filesystem (For example ext3)

Blocks

4096 Bytes (Can be set at FS creation)

Generic Block Layer

Page Frames / Block IO Operations (bio)

 

I/O Scheduler Layer

bios per block device (Which this layer may combine)

 

Block Device Driver

Segment

512 Bytes

Hard Disk

Sector

512 Bytes

There are two basic system calls, read() and write(), that a user process can make to read data from a file system. In the kernel these are handled by the Linux Virtual Filesystem Switch (VFS). VFS is an abstraction to all file systems so they look the same to the user space and it also handles the interface between the file system and the block device layer. The caching layer provides caching of disk reads and writes in terms of memory pages. The generic block layer breaks down IO operations that might involve many different non-contiguous blocks into multiple IO operations. The I/O scheduling layer takes these IO operations and schedules them based on order on disk, priority, and/or direction. Lastly, the device driver handles interfacing with the hardware for the actual operations in terms of disk sectors which are usually 512 bytes.

 

A Little Bit on Page Caching:

 

The page cache caches pages of data that do or will reside on disk. Therefore before it writes data to disk it puts it in memory, and before it reads data from disk it checks to see if it is in memory already (With the exception of Direct IO). Writing pages out to disk actually gets deferred. This is done to increase performance so writes can be grouped together more efficiently. When a page of disk data gets changed and needs to be written out to disk it is called “dirty”. Since it is dangerous to keep pages in memory for too long in case of a system shutdown the kernel’s pdflush threads scan for dirty pages and then flushes them out to disk. Linux will actually try to use as much memory as it can for caching files which is why the top command usually shows so much used memory. When you want to see how much memory is free for processes you can run the free command and look at the ‘-/+ buffers/cache’.

 

iostat output:

 

So with this background lets look at some of the output of iostat and tie it together with our background knowledge. Iostat can break down the statistics at both the partition level and then device level, however in this post I am going to focus on the device level.

 

The Overview Statistics: “Is it Saturated or Not?”

 

From iostat there are two summary statistics which are Input/Output CPU wait time (iowait) and device utilization which are both expressed in terms of percentages.

 

iowait is from the CPU’s perspective and it is the percentage of time that the CPU spent waiting for a IO device to be ready. Another way to look at iowait is the amount of time that the CPU could have been doing something but couldn’t because all the processes were waiting on the disk or the network devices.

 

Device utilization is covered throughly by Alex Gorbahev in Basic I/O Monitoring on Linux. He summarizes it as “The percentage of time the device spent servicing requests as opposed to being idle.”

 

iostat and caching:

 

It is import to note that iostat shows requests to the device (or partition) and not read and write requests from user space. So in the table above iostat is reading below the disk cache layer. Therefore, iostat says noting about your cache hit ratio for block devices. So it is possible that disk IO problems might be able to be resolved by memory upgrades. From my research there is no way to pull out a cache hit/miss ratio out of Linux easily when it comes to block devices which is a bit disappointing.  One suggestion from serverfault is to install a kernel with debuging symbols and use SystemTap to trace the VFS events and tie them together with the block layer events. I intend to explore this but I would prefer to see a way to get this data from /proc or /sys.

 

iostat Output for Random and Sequential Reads:

 

One of the main things to do when examining disk IO is to determine if the disk access patterns are sequential or random. This information can aid in our disk choices. When operations are random the seek time of the disk becomes more important. This is because physically the drive head has to jump around. Seek time is the measurement of the speed at which the heads can do this. For small random reads solid state disks can be a huge advantage.

 

So in fio I have created two different simple tests to run. The first is sequential reading, and the second is random reading. During these tests I ran iostat -x 3 throughout the test.

 

Snapshot of Random Read Test:

 

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  172.67    0.00  1381.33     0.00     8.00     0.99    5.76   5.76  99.47

 

Snapshot of Sequential Read Test:

 

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              13.00     0.00  367.00    0.00 151893.33     0.00   413.88     2.46    6.71   2.72 100.00

 

What is more important to me for this is not just what these numbers are but what, in the context of of random vs sequential reading and in context of the IO subsystem, they mean.

 

The first two columns, rrqm/s and wrqm/s, are read and write requests merged per second. In my above diagram of the Linux Block IO subsystem above I mentioned that that the scheduler can combine operations. This can be done when multiple operations are physically adjacent to each other on the device. So in sequential operation it would make sense to often see a large number of merges. In the snapshot of the random reads, we see no merges. However, the merging layer feels a little bit like “magic” and I don’t believe it is the best indicator of if the patterns are random or sequential.

 

The next 5 columns are read and write requests to the device (r/s, w/s), followed by the amount of sectors read and written from the device (rsec/s, wsec/s), and then the size of each request (avgrq-sz). In the random test there are 172 reads that result in 1,381 sectors being read in. In the sequential test there are 367 read request to 151,893 sectors being read. So in the random test we get about 8 sectors per request and in the sequential test we get 413 sectors per read. If you look closely, this happens to be the same number as avgrq-sz which does this math for us (Sectors Read / Read Operations). However it is worth noting that this is how it is calculated as the average request size does not differentiate between reads and writes. From these tests a low sector write/read to request ratio or small request sizes seem to indicate a random IO profile. I believe this to be a better indicator than the amount of merges as to whether or not there is random or sequential disk patterns.

 

The final 4 columns are the average queue length of requests to the device (avgqu-sz), how long requests took to be serviced including their time in the queue (await), how long requests took to be serviced by the device after they left the queue (svctm), and lastly the utilization percentage which I already mentioned in the overview statistics section. In the above example random requests take longer for the disk to service as expected because of the seek time. However, the queue itself ends up being shorter which I am unable to explain. Utilization, in more detail, is the service time in ms * total IO operations / 1000 ms. This gives the percentage of how busy the single disk was during the given time slice. I believe for a given utilization level a higher number of operations is probably indicative of a sequential pattern.

 

I have run various variations on the above. They include a mixture of reads and writes for both random and sequential data as well as sequential and random writes. For the writes I got similar results as far as the ratios were concerned and queue and services time were higher.

 

In the end it seems average request size is the key to show if the disk usage patterns are random or not since this is post merging. Taking this into the context of the layers above this might not mirror what an application is doing. This is because a read or write operations coming from user space might operate on a fragmented file in which case the generic block layer will break it up and it appears as random disk activity.

 

Conclusion:

 

As far as I am concerned this is only a start in interpreting IO statistics. I think these tests need to be repeated, perhaps with different tools to generate the disk IO, as my interpretations might just be totally off. Also, a pretty big limitation of what I did is that my work was all on a single disk and these numbers might have different results under various RAID configurations. I feel the inability to measure the cache hit ratio of reads on a block device is a significant shortcoming that I would love to see addressed since from a system administrators perspective the solution to certain IO problems might be to throw more memory at the problem.

Lastly, I want to make a point about these sort of low level statistics in general. Everything needs to monitored from the an application perspective as well. These statistics can be misleading and are most useful when they can be correlated with the data that actually matters to users of the applications, for example, response time from the user perspective. These also need to be monitored over time because you want to be able to see changes for capacity planning as well as to give them context to past performance when problems arise.

Further Reading:
http://www.igvita.com/2009/06/23/measuring-optimizing-io-performance/
Understanding The Linux Kernel, Chapter 14
http://www.ufsdump.org/papers/io-tuning.pdf
http://bhavin.directi.com/iostat-and-disk-utilization-monitoring-nirvana/
http://www.kernel.org/doc/Documentation/iostats.txt