blktrace is a block layer IO tracing mechanism which provides detailed  information  about  request  queue operations up to user space.

There are three major components:

a kernel component
a utility to record the i/o trace information for the kernel to user space
utilities to analyse and view the  trace  information


Installation

On RHEL/CentOS

# yum install blktrace

================================================================
 Package     Arch    Version         Repository             Size
================================================================
Installing:
 blktrace   x86_64   1.0.5-6.el7     sl                    132 k

Transaction Summary
================================================================
Install  1 Package

Total download size: 132 k
Installed size: 297 k
Is this ok [y/d/N]: y

Installed:
  blktrace.x86_64 0:1.0.5-6.el7

blktrace Usage

Usage: blktrace 

-d <dev>             | --dev=<dev>
[ -r <debugfs path>  | --relay=<debugfs path> ]
[ -o <file>          | --output=<file>]
[ -D <dir>           | --output-dir=<dir>
[ -w <time>          | --stopwatch=<time>]
[ -a <action field>  | --act-mask=<action field>]
[ -A <action mask>   | --set-mask=<action mask>]
[ -b <size>          | --buffer-size]
[ -n <number>        | --num-sub-buffers=<number>]
[ -l                 | --listen]
[ -h <hostname>      | --host=<hostname>]
[ -p <port number>   | --port=<port number>]
[ -s                 | --no-sendfile]
[ -I <devs file>     | --input-devs=<devs file>]
[ -v <version>       | --version]
[ -V <version>       | --version]
    -d Use specified device. May also be given last after options
    -r Path to mounted debugfs, defaults to /sys/kernel/debug
    -o File(s) to send output to
    -D Directory to prepend to output file names
    -w Stop after defined time, in seconds
    -a Only trace specified actions. See documentation
    -A Give trace mask as a single value. See documentation
    -b Sub buffer size in KiB (default 512)
    -n Number of sub buffers (default 4)
    -l Run in network listen mode (blktrace server)
    -h Run in network client mode, connecting to the given host
    -p Network port to use (default 8462)
    -s Make the network client NOT use sendfile() to transfer data
    -I Add devices found in <devs file>
    -v Print program version info
    -V Print program version info

Use blktrace to trace disk i/o

1. Trace I/O

The command below trace disk sdp for 10 seconds, and send the trace result to file trace.blktrace.[0-11], by default, the output filename in the example below should be sdp.blktrace.[0-11]

# blktrace -d /dev/sdp -o trace -w 10
=== sdp ===
  CPU  0:                 8780 events,      412 KiB data
  CPU  1:                 3173 events,      149 KiB data
  CPU  2:                 3600 events,      169 KiB data
  CPU  3:                 1445 events,       68 KiB data
  CPU  4:                 4604 events,      216 KiB data
  CPU  5:                 3764 events,      177 KiB data
  CPU  6:                 1152 events,       54 KiB data
  CPU  7:                 1142 events,       54 KiB data
  CPU  8:                  335 events,       16 KiB data
  CPU  9:                  936 events,       44 KiB data
  CPU 10:                 1470 events,       69 KiB data
  CPU 11:                  385 events,       19 KiB data
  Total:                 30786 events (dropped 0),     1444 KiB data

2. blkparse parse the trace output file

#blkparse -i trace.blktrace.* -d sdp.bin

...
Total (trace):
 Reads Queued:       4,833,  117,316KiB     Writes Queued:           0,        0KiB
 Read Dispatches:    4,833,  117,316KiB     Write Dispatches:        0,        0KiB
 Reads Requeued:         0         Writes Requeued:         0
 Reads Completed:    4,828,  116,829KiB     Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB     Write Merges:            0,        0KiB
 IO unplugs:             0             Timer unplugs:           0

Throughput (R/W): 11,692KiB/s / 0KiB/s
Events (trace): 28,993 entries
Skips: 0 forward (0 -   0.0%)

3. Use btt to get device i/o activites statistics

# btt -i sdp.bin | more
==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000002693   0.002064896   0.224947065        4832
Q2G               0.000000289   0.000010970   0.010791768       57996
G2I               0.000000214   0.000000961   0.000033457       57996
I2D               0.000000160   0.000005564   0.005938405       57108
D2C               0.000135164   0.003013303   0.035674715        4805
Q2C               0.000168496   0.003040692   0.035678620        4828

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,240) |   4.3338%   0.3797%   0.0000%   2.1643%  98.6272%
---------- | --------- --------- --------- --------- ---------
   Overall |   4.3338%   0.3797%   0.0000%   2.1643%  98.6272%

==================== Device Merge Information ====================

       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
---------- | -------- -------- ------- | -------- -------- -------- --------
 (  8,240) |    57996    57996     1.0 |       25       48      256  2830284

==================== Device Q2Q Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8,240) |            4833        712982.8               0 | 0(2296)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |            4833        712982.8               0 | 0(2296)

==================== Device D2D Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8,240) |           57996         59415.2               0 | 0(55458)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |           57996         59415.2               0 | 0(55458)

==================== Plug Information ====================

       DEV |    # Plugs # Timer Us  | % Time Q Plugged
---------- | ---------- ----------  | ----------------

       DEV |    IOs/Unp   IOs/Unp(to)
---------- | ----------   ----------
 (  8,240) |        0.0          0.0

==================== Active Requests At Q Information ====================

       DEV |  Avg Reqs @ Q
---------- | -------------
 (  8,240) |           0.0
...

Interpreting blktrace Information

Note: All time is mili seconds

Q2Q — time between requests sent to the block layer
Q2G — how
long it takes from the time a block I/O is queued to the time it gets a request allocated for it
G2I — how
long it takes from the time a request is allocated to the time it is Inserted into the device's queue
Q2M — how
long it takes from the time a block I/O is queued to the time it gets merged with an existing request
I2D — how
long it takes from the time a request is inserted into the device's queue to the time it is actually issued to the device
M2D — how
long it takes from the time a block I/O is merged with an exiting request until the request is issued to the device
D2C — service time of the request by the device
Q2C — total time spent in the block layer
for a request

I/O sequence

 Q------->G------------>I--------->M------------------->D----------------------------->C
|-Q time-|-Insert time-|
|--------- merge time ------------|-merge with other IO|
|----------------scheduler time time-------------------|---driver,adapter,storagetime--|

|----------------------- await time in iostat output ----------------------------------|

Hints and tips for I/O tuning

  • If Q2Q is much larger than Q2C, that means the application is not issuing I/O in rapid succession. hus, any performance problems you have may not be at all related to the I/O subsystem
  • D2C is very high, then the device is taking a long time to service requests. This can indicate that the device is simply overloaded (which may be due to the fact that it is a shared resource), or it could be because the workload sent down to the device is sub-optimal.
  • If Q2G is very high, it means that there are a lot of requests queued concurrently. This could indicate that the storage is unable to keep up with the I/O load.
  • await in iostat output = Q2C = Q2I + I2D + D2C
  • Q2I + I2D == scheduler time
  • The I2D time can include a lot of apparent extra time due to plug and unplug events (not shown above) which are used to improve merging of io within the schedule sort queue
  • D2C time covers driver time, adapter time, transport time, and storage service time (and back)
  • So if D2C/Q2C is approaching to 1, it  means % time spent on storage component is high.
  • high D->C times, the underlying transport structure of storage needs to be examined, such as switch counters or maintenance interfaces for storage boxes themselves.