A disk has a block size which is the minimum amount of data that it can read or write, normally it's 512 bytes. It's generally transparent to users.

HDFS has the similar concept of a block too, but it is a much larger unit, 128MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block sized chunks, and stored as independent units. However, unlike filesystem, a file in HDFS that is smaller than a single block does not occupy a full block of storage. For example, a 1MB file uses 1MB of disk space, not 128MB whole block in HDFS.

Why is a block in HDFS so large?

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.

The larger the better?

No, this shouldn't be taken too far, because Map tasks in >MapReduce normally operate on one block at a time, so if you have too few tasks, your jobs will run slower than they could.



Comments powered by CComment