There are lots of configuration files in Hadoop, where are they and what are these Hadoop configuration files?
Note different version of Hadoop, the configuration location and names are different, in this article, I'm using Hadoop 2.7.3
In Hadoop 2.7.3, the configuration files are located under $HADOOP_HOME/etc/hadoop
These files are used to configure a hadoop cluster.
All Hadoop services and clients use this file to locate the NameNode, so this file must be copied to each node that is either running a Hadoop service or is a client node. The Secondary NameNode uses this file to determine the location for storing fsimage and edits log namefs.checkpoint.dir/name locally, and the location of the NameNode namefs.namedefault.name/name. Use the core-site.xml file to isolate communication issues with the NameNode host machine
HDFS services use this file, which contains a number of important properties. These include:
HTTP addresses for the two services
Replication for DataNodes namedfs.replication/name>
DataNode block storage location namedfs.data.dir/name
NameNode metadata storage namedfs.name.dir/name
Use the hdfs-site.xml file to isolate NameNode start-up issues. Typically, NameNode start-up issues are caused when NameNode fails to load the fsimage and edits log to merge. Ensure that the values for the location properties in hdfs-site.xml are valid locations.
Use the log4j.properties file to modify the log purging intervals of the HDFS logs. This file defines logging for all the Hadoop services. It includes, information related to appenders used for logging and layout. For more details, see the log4j documentation.
hadoop-policy.xml configuration file is used to define the access control lists for various Hadoop services. Service Level Authorization is performed much before to other access control checks such as file-permission checks, access control on job queues etc
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
HttpFS is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is inteoperable with the webhdfs REST HTTP API.
Hadoop KMS is a cryptographic key management server based on Hadoop’s KeyProvider API.
It provides a client and a server components which communicate over HTTP using a REST API.
The client is a KeyProvider implementation interacts with the KMS using the KMS HTTP REST API.
KMS and its client have built-in security and they support HTTP SPNEGO Kerberos authentication and HTTPS secure transport.
KMS is a Java web-application and it runs using a pre-configured Tomcat bundled with the Hadoop distribution.
The CapacityScheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.
The CapacityScheduler supports the following features:
- Hierarchical Queues - Hierarchy of queues is supported to ensure resources are shared among the sub-queues of an organization before other queues are allowed to use free resources, there-by providing more control and predictability.
- Capacity Guarantees - Queues are allocated a fraction of the capacity of the grid in the sense that a certain capacity of resources will be at their disposal. All applications submitted to a queue will have access to the capacity allocated to the queue. Adminstrators can configure soft limits and optional hard limits on the capacity allocated to each queue.
- Security - Each queue has strict ACLs which controls which users can submit applications to individual queues. Also, there are safe-guards to ensure that users cannot view and/or modify applications from other users. Also, per-queue and system administrator roles are supported.
- Elasticity - Free resources can be allocated to any queue beyond it's capacity. When there is demand for these resources from queues running below capacity at a future point in time, as tasks scheduled on these resources complete, they will be assigned to applications on queues running below the capacity (pre-emption is not supported). This ensures that resources are available in a predictable and elastic manner to queues, thus preventing artifical silos of resources in the cluster which helps utilization.
- Multi-tenancy - Comprehensive set of limits are provided to prevent a single application, user and queue from monopolizing resources of the queue or the cluster as a whole to ensure that the cluster isn't overwhelmed.
- Runtime Configuration - The queue definitions and properties such as capacity, ACLs can be changed, at runtime, by administrators in a secure manner to minimize disruption to users. Also, a console is provided for users and administrators to view current allocation of resources to various queues in the system. Administrators can add additional queues at runtime, but queues cannot be deleted at runtime.
- Drain applications - Administrators can stop queues at runtime to ensure that while existing applications run to completion, no new applications can be submitted. If a queue is in STOPPED state, new applications cannot be submitted to itself or any of its child queueus. Existing applications continue to completion, thus the queue can be drained gracefully. Administrators can also start the stopped queues.
- Resource-based Scheduling - Support for resource-intensive applications, where-in a application can optionally specify higher resource-requirements than the default, there-by accomodating applications with differing resource requirements. Currently, memory is the the resource requirement supported.
Following are sets of log files for each of the HDFS services. They are stored under
These files are created and written to when HDFS services are bootstrapped. Use these files to isolate launch issues with DataNode, NameNode, or Secondary NameNode services.
HDFS .out files: Log files with the .out extension are located in c:\hdp\hadoop-1.1.0-SNAPSHOT\bin. They have the following naming conventions:
Hostname is the host where the service is running. For example, on a node where the host name is host3, the file would be saved as hadoop-namenode-host3.log.
HDFS .err files:
These files contain log messages for the running Java service. If there are any errors encountered while the service is already running, the stack trace of the error is logged in the above files.
HDFS .wrapper files:
These files contain the start-up command string to start the service, and list process ID output on service start-up.
HDFS <.date> files:
Log files with the <.date> extension have the following format:
When a .log file is rotated, the current date is appended to the file name; for example: hadoop-datanode1.localdomain.com.log.2016-12-08.