Like conventional filesystem, Hadoop HDFS also offer filesystem consistency and integration check. Close enough, the command is also called fsck, this can be used to identify corrupt files on Hadoopy HDFS

HDFS filesystem checking utility

The command usage

$ hdfs fsck
Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]] [-includeSnapshots] [-storagepolicies] [-blockId <blk_Id>]
    <path>    start checking from this path
    -move    move corrupted files to /lost+found
    -delete    delete corrupted files
    -files    print out files being checked
    -openforwrite    print out files opened for write
    -includeSnapshots    include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
    -list-corruptfileblocks    print out list of missing blocks and files they belong to
    -blocks    print out block report
    -locations    print out locations for every block
    -racks    print out network topology for data-node locations
    -storagepolicies    print out storage policy summary for the blocks
    -blockId    print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

Simple HDFS fsck

$ hdfs fsck  /
Connecting to namenode via http://<namenode>:50070/fsck?ugi=hadoop&path=%2F
FSCK started by hadoop (auth:SIMPLE) from /192.168.0.1 for path / at Thu Dec 15 16:01:25 PST 2016
.................................................................................................
.................................................................................................
..............................................................Status: HEALTHY
 Total size:    238102023128 B
 Total dirs:    17
 Total files:    862
 Total symlinks:        0
 Total blocks (validated):    2261 (avg. block size 105308280 B)
 Minimally replicated blocks:    2261 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    2
 Average block replication:    1.985847
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        2
 Number of racks:        1

How do I know if HDFS filesystem has corrupt blocks?

The easiest way to determine this is to run an fsck on the filesystem. If you have setup your hadoop environment variables you should be able to use a path of /.

hdfs fsck /

or.

hdfs fsck hdfs://<namenode>:50070/

If the end of fsck output looks like this, you have corrupt blocks in your HDFS.

...
/user/hadooptest/test59:  Under replicated BP-762523015-192.168.0.2-1480061879099:blk_1073741976_1152. Target Replicas is 2 but found 1 replica(s).

/user/hadooptest/test59:  Under replicated BP-762523015-192.168.0.2-1480061879099:blk_1073741977_1153. Target Replicas is 2 but found 1 replica(s).
.
/user/hadooptest/test6: CORRUPT blockpool BP-762523015-192.168.0.2-1480061879099 block blk_1073741845
...
 Total size:    238102023128 B
 Total dirs:    17
 Total files:    862
 Total symlinks:        0
 Total blocks (validated):    2261 (avg. block size 105308280 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:    32 (1.415303 %)
  CORRUPT FILES:    11
  MISSING BLOCKS:    32
  MISSING SIZE:        3606911412 B
  CORRUPT BLOCKS:     32
  ********************************
 Corrupt blocks:        32
 Number of data-nodes:        2
 Number of racks:        1
FSCK ended at Thu Dec 15 22:26:27 PST 2016 in 33 milliseconds

 

Note: In the example output above, you can see there are actually two types errors found

Under replicated
CORRUPT blockpool

Let's try to fix them one by one

How to fix "Under replicated" files

For this type of error, it's not a big issue. It's just means that a file that has 2 or more replicas, but now missing one copy of replica, to fix it, just try to make an extra copy for these files:

So, get the list of replicated files first

$ hdfs fsck / | grep -i "under replicated" | awk '{print $1}' |sort | uniq | sed -e 's/://g'> under_replicated.flst

Then, run the command below to fix it

$ for f in `cat under_replicated.flst` { echo "Fixing $f" ;  hdfs dfs -setrep 3 $f; }

Example output:

...
fixing /fibrevillage/usr/bin/catman
Replication 3 set: /fibrevillage/usr/bin/catman
fixing /fibrevillage/usr/bin/cc
...

How do I know which files have corrupt blocks?

The output of the fsck above will be very verbose, but it will mention which blocks are corrupt.

$ hdfs fsck / | egrep -v '^\.+$' | grep -i corrupt

Example output:

...
/user/hadooptest/test9: CORRUPT blockpool BP-762523015-192.168.0.2-1480061879099 block blk_1073741855
/user/hadooptest/test9: CORRUPT blockpool BP-762523015-192.168.0.2-1480061879099 block blk_1073741856
/user/hadooptest/test9: MISSING 3 blocks of total size 339281920 B.Status: CORRUPT
  CORRUPT FILES:    11
  CORRUPT BLOCKS:     32
 Corrupt blocks:        32
The filesystem under path '/' is CORRUPT

Above is part of corrupted file list, and the output will not be a bunch of dots, and also files that might currently have under-replicated blocks (which isn't necessarily an issue). The output should include something like this with all your affected files.

How to locat corrupted block?

The first step would be to gather information on the file's location, and blocks.

Find the corrupted file list:

$ hdfs fsck / | egrep -v '^\.+$' | grep -i "corrupt blockpool"| awk '{print $1}' |sort |uniq |sed -e 's/://g' >corrupted.flst

Run command below one by one(you could do it in a loop if situation is same)

$ hdfs fsck /user/hadooptest/test1 -locations -blocks -files
FSCK started by hadoop (auth:SIMPLE) from /192.168.0.2 for path /user/hadooptest/test1 at Thu Dec 15 23:32:30 PST 2016
/user/hadooptest/test1 339281920 bytes, 3 block(s):
/user/hadooptest/test1: CORRUPT blockpool BP-762523015-192.168.0.2-1480061879099 block blk_1073741830

/user/hadooptest/test1: CORRUPT blockpool BP-762523015-192.168.0.2-1480061879099 block blk_1073741831

/user/hadooptest/test1: CORRUPT blockpool BP-762523015-192.168.0.2-1480061879099 block blk_1073741832
 MISSING 3 blocks of total size 339281920 B
0. BP-762523015-192.168.0.2-1480061879099:blk_1073741830_1006 len=134217728 MISSING!
1. BP-762523015-192.168.0.2-1480061879099:blk_1073741831_1007 len=134217728 MISSING!
2. BP-762523015-192.168.0.2-1480061879099:blk_1073741832_1008 len=70846464 MISSING!

So, you can tell that the corrupted block is on the server 192.168.0.2

Log on the the node, check its datanode log, you will see something like this:

ExitCodeException exitCode=1: chmod: changing permissions of ‘/disk/c5t7’: Operation not permitted

This piece information tells you that the filesystem/block /disk/c5t7/ has i/o error

How to repair a corrupted file?

From above section, we tracked down the node where the corruption is. Also, by looking through logs, we know where the problem was. If at this layer the problem can get fixed, and this piece of data directory can be brought back to online, then the error get fixed.

Note: this fix applies to "under replicas" error too

However, if underneath disk has media problem, can be replaced but data on it will get lost.The corrupted file more likely can't be recovered if there is no backup.

Remove the corrupted file from your hadoop cluster

This command will move the corrupted file to the trash.

hdfs dfs -rm /path/to/corrupted.flst

Or you can skip the trash to permanently delete (which is probably what you want to do)

hdfs dfs -rm -skipTrash /path/to/corrupted.flst