Hadoop HDFS blocks allocation strategy tries to spread new blocks evenly among all the datanodes. In a large scale cluster, each node has different capacity, while quite often you need decommission some old nodes, also adding new nodes for more capacity. How Hadoop balance the space usage on all data nodes?

And, how to protect new node from being overloaded and being bottleneck due to all the new blocks would be allocated and read from that datanode?

The answer is that Hadoop has its balanced policy to make sure all nodes are being balanced used, also there is HDFS balancer to rebalance the space usage among the cluster datanodes, for unbalanced situation like new nodes adding, deletion caused unbalancing etc..

Hadoop space balance policy

Hadoop has 3 space balance related parameter to play with

Balanced space preference fraction

The configurable variable is

dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction

It's default number is 0.75(75%)

This setting controls what percentage of new block allocations will be sent to volumes with more available disk space than others. This setting should be in the range 0.0 - 1.0, though in practice 0.5 - 1.0, since there should be no reason to prefer that volumes with less available disk space receive more block allocations. 

Balanced space threshold

It's configurable variable is

dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold

Default number 10737418240(10GB)

This setting controls how much DN volumes are allowed to differ in terms of bytes of free disk space before they are considered imbalanced. If the free space of all the volumes are within this range of each other, the volumes will be considered balanced and block assignments will be done on a pure round robin basis. 

Balance bandwidth control

The configurable variable is

dfs.datanode.balance.bandwidthPerSec

Specifies the maximum amount of bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second. 

 

Considerations

HDFS data might not always be be placed uniformly across the DataNode. One common reason is addition of new DataNodes to an existing cluster. While placing new blocks (data for a file is stored as a series of blocks), NameNode considers various parameters before choosing the DataNodes to receive these blocks. Some of the considerations are:

  • Policy to keep one of the replicas of a block on the same node as the node that is writing the block.

  • Need to spread different replicas of a block across the racks so that cluster can survive loss of whole rack.

  • One of the replicas is usually placed on the same rack as the node writing to the file so that cross-rack network I/O is reduced.

  • Spread HDFS data uniformly across the DataNodes in the cluster.

HDFS Balancer

Due to multiple competing considerations, data might not be uniformly placed across the DataNodes. HDFS provides a tool for administrators that analyzes block placement and rebalanaces data across the DataNode. A brief administrator’s guide for balancer is available at HADOOP-1652.

HDFS balancer doesn't run at background, has to run manually.

Help entry from the command line:

Usage: hdfs balancer
    [-policy <policy>]    the balancing policy: datanode or blockpool
    [-threshold <threshold>]    Percentage of disk capacity
    [-exclude [-f <hosts-file> | <comma-separated list of hosts>]]    Excludes the specified datanodes.
    [-include [-f <hosts-file> | <comma-separated list of hosts>]]    Includes only the specified datanodes.
    [-idleiterations <idleiterations>]    Number of consecutive idle iterations (-1 for Infinite) before exit.

The threshold parameter is a float number between 0 and 100 (12.5 for instance). From the average cluster utilization, the balancer process will try to converge all datanodes' usage in the range [average - threshold, average + threshold].

Default threshold is 10%

For example, if the cluster current utilization is 50% full, then higher usage datanodes will start move data to lower usage nodes.

- Higher (average + threshold): 60%
- Lower (average - threshold): 40%

Here is an example output:

$ hdfs balancer -threshold 10
16/12/10 22:43:45 INFO balancer.Balancer: Using a threshold of 10.0
16/12/10 22:43:45 INFO balancer.Balancer: namenodes  = [hdfs://<namenode>:9000]
16/12/10 22:43:45 INFO balancer.Balancer: parameters = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration = 5, number of nodes to be excluded = 0, number of nodes to be included = 0]
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
16/12/10 22:43:46 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.0.1:50010
16/12/10 22:43:46 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.0.2:50010
16/12/10 22:43:46 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.0.3:50010
16/12/10 22:43:46 INFO balancer.Balancer: 1 over-utilized: [192.168.0.1:50010:DISK]
16/12/10 22:43:46 INFO balancer.Balancer: 0 underutilized: []
16/12/10 22:43:46 INFO balancer.Balancer: Need to move 51.13 GB to make the cluster balanced.
16/12/10 22:43:46 INFO balancer.Balancer: Decided to move 10 GB bytes from 192.168.0.1:50010:DISK to 192.168.0.2:50010:DISK
16/12/10 22:43:46 INFO balancer.Balancer: Will move 10 GB in this iteration

You can easily notice that the smaller your threshold, the more balanced your datanodes will be. For very small threshold, the cluster may not be able to reach the balanced state if other clients concurrently write and delete data in the cluster.

Here is the detail steps about balancer phases

First, the balancer will pick datanodes with current usage above the higher threshold , and try to find blocks from these datanodes that could be copied into nodes with current usage below the lower threshold
Second, the balancer will select over-utilized nodes to move blocks to nodes with utilization below average.
Third, the balancer will pick nodes with utilization above average to move data to under-utilized nodes.

In addition to that selection process, the balancer can also pick a proxy node if the source and the destination are not located in the same rack (i.e. a datanode storing a replica of the block and located in the same rack than the destination). Yes, balancer is rack aware and will generate very little rack-to-rack noise.

Iterative steps in the balancer

The balancer process is iterative. As HDFS has lots of moving state, the balancer try at each iteration to move 10 GB * number of selected sources. Each iteration does not last more than 20 minutes.

Tips to administrator

If you haven't balance your cluster for a long time, you should start by balancing with a higher threshold (like 25), and then converging to a smaller target threshold (like 10).