Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of overall cluster health.

This post eplains the purpose of checkpoint in HDFS, the technical details of how checkpoint works in Hadoop HDFS cluster configurations, some operational concerns as well.

Filesystem Metadata in HDFS

In HDFS, the NameNode’s primary responsibility is storing the HDFS namespace. The directory tree, file permissions, and the mapping of files to block IDs. It’s important that this metadata (and all changes to it) are safely persisted to stable storage for fault tolerance.

This filesystem metadata is stored in two different constructs: the fsimage and the edit log.

The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem.

The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

As an aside, this pattern of using a log for incremental changes on top of another storage format is quite common for traditional filesystems. Log-structured filesystems take this to an extreme and use just a log for persisting data, but more common journaling filesystems like EXT3, EXT4, and XFS support writing changes to a journal before applying them to their final locations on disk.

Why is Checkpointing?

A typical edit ranges from 10s to 100s of bytes, but over time enough edits can accumulate to become unwieldy. A couple of problems can arise from these large edit logs. In extreme cases, it can fill up all the available disk capacity on a node, but more subtly, a large edit log can substantially delay NameNode startup as the NameNode reapplies all the edits. This is where checkpointing comes in.

Checkpointing is a process that takes an fsimage and edit log and compacts them into a new fsimage. This way, instead of replaying a potentially unbounded edit log, the NameNode can load the final in-memory state directly from the fsimage. This is a far more efficient operation and reduces NameNode startup time.

However, creating a new fsimage is an I/O- and CPU-intensive operation, sometimes taking minutes to perform. During a checkpoint, the namesystem also needs to restrict concurrent access from other users. So, rather than pausing the active NameNode to perform a checkpoint, HDFS defers it to either the SecondaryNameNode or Standby NameNode, depending on whether NameNode high-availability is configured. The mechanics of checkpointing differs depending on if NameNode high-availability is configured; we’ll cover both.

How to config HDFS Checkpointing

Hadoop HDFS Checkpointing is triggered by one of two conditions:

dfs.namenode.checkpoint.period

Time has elapsed since the last checkpoint

dfs.namenode.checkpoint.txns

Number of new edit log transactions since last checkpoint

dfs.namenode.checkpoint.check.period

The checkpointing node periodically checks if either of these conditions are met, and if so, kicks off the checkpointing process.

Operational Concerns

The biggest operational concern related to checkpointing is when it fails to happen. We’ve seen scenarios where NameNodes accumulated hundreds of GBs of edit logs, and no one noticed until the disks filled completely and crashed the Namenode. When this happens, there’s not much to do besides restart the NameNode and wait for it to replay all the edits. So, some monitoring may needed

Checkpointing is a very I/O and network intensive operation and can affect client performance. This is especially true on a large cluster with millions of files and a multi-GB fsimage, since copying a new fsimage to the NameNode can eat up all available bandwidth. Adjust the following properties if needed.

The transfer speed and timeout:

dfs.image.transfer.bandwidthPerSec. 
dfs.image.transfer.timeout