blktrace is a block layer IO tracing mechanism which provides detailed  information  about  request  queue operations up to user space.

 

There are three major components:

A kernel component
A utility to record the i/o trace information for the kernel to user space utilities to analyse and view the  trace  information

I'm going to show you how to use blktrace to identify disk drive performance problem/bottle neck.

How to use blktrace to analysis i/o statistics

In the first test bed, I have A host that has 4 CPU cores, single thread/core, 8GB memory. 4 SAN storage LUNs connected via dual FC connections, kernel 3.10.0-327.10.1.el7.x86_64.

So, I started two rsync processes, using two different LUNs to copy files to one LUN. Using blktrace to capture i/o details for two LUNs.

LUN11 is one of the LUNs rsync large data to LUN14.

Step 1, use blktrace to trace i/o

blktrace -d /dev/mapper/lun11 -w 600 &
blktrace -d /dev/mapper/lun14 -w 600 &

-w is the option to set period of time that blktrace trace the i/o for specified block device.

The default output files are lun.blktrace.[0-3], if you have more CPUs, you will get more output files.

Once this capture is done, we can use blkparse to parse the output files.

Step 2, Parse blktrace output

The trace is stored in a binary format, which obviously doesn't make for convenient reading. The tool for that job is blkparse, a simple interface for analyzing the IO traces dumped by blktrace. They are packaged together, so you'll have blkparse if you have blktrace

blkparse -i lun11.blktrace.* -d lun11.bin
blktrace -i lun14.blktrace.* -d lun14.bin

Step 3, use btt generate a report for the i/o tracing.

Before using btt to get i/o tracing report, here are something you need to know first.

I/O time and key points

 Q------->G------------>I--------->M------------------->D----------------------------->C
|-Q time-|-Insert time-|
|--------- merge time ------------|-merge with other IO|
|----------------scheduler time time-------------------|---driver,adapter,storagetime--|

|----------------------- await time in iostat output ----------------------------------|
Here is the detail: 
Q2Q — time between requests sent to the block layer
Q2G — time
from a block I/O is queued to the time it gets a request allocated for it
G2I — time f
rom a request is allocated to the time it is Inserted into the device's queue
Q2M — time
from a block I/O is queued to the time it gets merged with an existing request
I2D — time
from a request is inserted into the device's queue to the time it is actually issued to the device
M2D — time from
a block I/O is merged with an exiting request until the request is issued to the device
D2C — service time of the request by the device
Q2C — total time spent in the block layer
for a request

For lun11, run

btt -i lun11.bin 
==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
Q2Qdm             0.000001227   0.001820250   1.428553495      329570
Q2Cdm             0.000094190   0.002175531   1.427059604      329569
Q2G               0.000000286   0.000001139   0.000843182     1318284
G2I               0.000000517   0.000001187   0.000039631     1318284
I2D               0.000000439   0.000000971   0.000297697     1318284
D2C               0.000087763   0.002172234   1.427056223      329569

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------

...

So, how to intepret these numbers ?

Let's start with these two time values

Q2C — total time spent in the block layer for a request
D2C — service time of the request by the device

Q2C is the total IO time (time from being queued to being completed, just like in the example above), and D2C is the IO time spent in the device. The values are in seconds, and N is the number of IOs observed. the queue time isn't very high, so most of what is going on with performance is the device (D2C) time.

D2C/Q2C = 0.93

This indicates most of time i/o time spent on device, only 7% spend spent on request queuing, mergering etc.. This is a really good number.

On the other hand, the time spending on external storage(D2C) is too high, you should also check the storage component, see if there are some room for tuning. This includes FC  channels, switches, storage boxes etc.. If the number you are getting is getting close to the storage limit, well done, you system is well tuned.

Some other conditions also relates to high D2C, for example, when there is race condition in external storage compoments, FC, switches etc.. In this case you can analysis i/o latency

How to use btt to get i/o lantency

The latency data files which can be optionally produced by btt provide per-IO latency information, one for total IO time (Q2C) and one for latencies induced by lower layer drivers and devices (D2C).

To get Q2C latency

btt -i lun14.bin -q lun14.q2c_lantency

To get D2C lantency

btt -i lun14.bin -l lun14.d2c_lantency

Here is some of the example output:

1.831078 0.071712
1.831130 0.072169
1.838799 0.079814
1.838843 0.079864
1.838872 0.079358
1.838928 0.079391
1.839568 0.080023
1.839758 0.080620
1.839843 0.080321
1.839897 0.080366
1.843325 0.084338

In both cases, the first column (X values) represent runtime (seconds), while the second column (Y values) shows the actual latency for a command at that time (either Q2C or D2C). Using these numbers, you can make graphs of the finest details of a system's IO performance.

Use btt to extracting the offset of IOs, seek times

Another useful capability of btt is extracting the offsets of IOs (with the -B flag). The offsets can then be plotted, showing an amazing amount of detail about the current IO workload.

btt -i lun14.bin -B lun14_dump_blocknos

This '-B' option will output absolute block numbers to three files prefixed by the specified output name:

prefix_device_r.dat
       All read block numbers are output, first column is time (seconds), second is the block  number,  and the third column is the ending block number.

prefix_device_w.dat
       All  write block numbers are output, first column is time (seconds), second is the block number, and the third column is the ending block number.

prefix_device_c.dat
       All block numbers (read and write) are output, first column is time (seconds), second is  the  block number, and the third column is the ending block number.

So, as you can see you can get each i/o's time spending, block number and end block number. Here is part of example output

    0.000106166 1610680551 1610680581
    0.000106166 1610680551 1610680581
    1.758736586 2767774808 2767775832
    1.758736586 2767774808 2767775832
    1.758736586 2767774808 2767775832
    1.758736586 2767774808 2767775832
    1.758749832 2767775832 2767776856
    1.758749832 2767775832 2767776856
    1.758749832 2767775832 2767776856

Measuring the Workload on Drives on a Linux System Using the blktrace Tool

On each hard drive, each block of stored data is given a logical block address (LBA). When the host system sends a read or write command to the drive, it tells the drive to read or write the data located at one or more LBAs. Using the blktrace tool, it is possible to record the starting LBA, transfer size, timestamp, and many other attributes for each request sent to the drive. The recorded data can be used to compute the read/write percentages, sequential percentage, and transfer size distribution.

Generating and Recording Drive Workloads

For the purpose of a blktrace walkthrough, imagine an example where a 10GB file will be written to a disk. Use blktrace to record the requests to and from the drive.

To make the output of blktrace readable, we can use blkparse to extract the information that we need. For example: 

blkparse -i lun14.blktrace.* -f "%5T.%9t, %p, %C, %a, %d, %N\n" -a complete -o output.txt

This creates a file with the timestamp, requesting host process, operation type (read/write), LBA number, and number of LBAs read or written for each transfer.

    1.946349480, 20434, kworker/1:2, C, W, 0
    1.946353564, 20434, kworker/1:2, C, W, 524288
    1.946353564, 20434, kworker/1:2, C, W, 524288
    1.946353564, 20434, kworker/1:2, C, W, 524288
    1.946353564, 20434, kworker/1:2, C, W, 524288
    1.946412750, 21, (null), C, W, 0
    1.946412750, 21, (null), C, W, 0
    1.946412750, 21, (null), C, W, 0
    1.946412750, 21, (null), C, W, 0
    1.947052418, 0, swapper/3, C, W, 524288

The –a complete filters the trace so that we only see commands that were completed.

Interpreting the Results

Once you have each IO's timestamp, operation type, LBA number, and number of LBAs read or written for each transfer. It's easy to get the following i/o statistics:

read/write ratio
average read size
average write size

Also, by analyzing these LBA numbers, you can tell if they are sequencial i/o or random io.  Analyzng the trace data shows the impact of metadata and the Linux I/O scheduler on overall performance.

So,

This experiment showed an example of how to use blktrace to characterize a workload at the drive level. The measurement showed that the actual workload seen by the drive was different than the parameters used to generate the system- level workload would suggest. Metadata overhead interrupted the sequential large file write, while the optimizations performed by the Linux I/O scheduler increased the transfer size seen by the drive. Understanding the connection between system-level workloads and the I/O pattern that the drive experiences is essential to optimizing

It's also just scratching the surface of the uses of the blktrace family. Other capabilities include counting seeks (though that's getting less interesting as SSDs take over the world), understanding merge behavior, and analyzing the overhead of various parts of the Linux IO subsystem. This is a set of tools that should be more widely known and used.performance.

 

Ref:

http://duch.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materialy/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf

http://www.cse.unsw.edu.au/~aaronc/iosched/doc/btt.html#sec:lat-d2c