YARN in Hadoop

So, what is YARN in Hadoop? Apache YARN (Yet Another Resource Negotiator) is a resource management layer in Hadoop. YARN came into the picture with the introduction of Hadoop 2.x. It allows various data processing engines such as interactive processing, graph processing, batch processing, and stream processing to run and process data stored in HDFS (Hadoop Distributed File System).

YARN was introduced to make the most out of HDFS, and job scheduling is also handled by YARN.

Now that YARN has been introduced, the architecture of Hadoop 2.x provides a data processing platform that is not only limited to MapReduce. It lets Hadoop process other-purpose-built data processing systems as well, i.e., other frameworks can run on the same hardware on which Hadoop is installed.

Why is YARN in Hadoop used?

In spite of being thoroughly proficient at data processing and computations, Hadoop 1.x had some shortcomings like delays in batch processing, scalability issues, etc. as it relied on MapReduce for processing big datasets. With YARN, Hadoop is now able to support a variety of processing approaches and has a larger array of applications. Hadoop YARN clusters are now able to run stream data processing and interactive querying side by side with MapReduce batch jobs. YARN framework runs even the non-MapReduce applications, thus overcoming the shortcomings of Hadoop 1.x.

Next, let’s discuss the Hadoop YARN architecture.

Hadoop YARN Architecture

Now, we will discuss the architecture of YARN. Apache YARN framework contains a Resource Manager (master daemon), Node Manager (slave daemon), and an Application Master. Let’s now discuss each component of Apache Hadoop YARN one by one in detail.

Resource Manager

Resource Manager is the master daemon of YARN. It is responsible for managing several other applications, along with the global assignments of resources such as CPU and memory. It is basically used for job scheduling. Resource Manager has two components:

  • Scheduler: Schedulers’ task is to distribute resources to the running applications. It only deals with the scheduling of tasks and hence it performs no tracking and no monitoring of applications.
  • Application Manager: Application Manager manages applications running in the cluster. Tasks, such as the starting of Application Master or monitoring, are done by the Application Manager.

Node Manager

Node Manager is the slave daemon of YARN. It has the following responsibilities:

  • Node Manager has to monitor the container’s resource usage, along with reporting it to the Resource Manager.
  • The health of the node on which YARN is running is tracked by the Node Manager.
  • It takes care of each node in the cluster while managing the workflow, along with user jobs on a particular node.
  • It keeps the data in the Resource Manager updated
  • Node Manager can also destroy or kill the container if it gets an order from the Resource Manager to do so.

The third component of Apache Hadoop YARN is the Application Master.

Application Master

Every job submitted to the framework is an application, and every application has a specific Application Master associated with it. Application Master performs the following tasks:

  • It coordinates the execution of the application in the cluster, along with managing the faults.
  • It negotiates resources from the Resource Manager.
  • It works with the Node Manager for executing and monitoring other components’ tasks.
  • At regular intervals, heartbeats are sent to the Resource Manager for checking its health, along with updating records according to its resource demands.

Container

A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a single node. The tasks of a container are listed below:

  • It grants the right to an application to use a specific amount of resources (memory, CPU, etc.) on a specific host.
  • YARN containers are particularly managed by a Container Launch context which is Container Life Cycle (CLC). This record contains a map of environment variables, dependencies stored in remotely accessible storage, security tokens, payload for Node Manager services, and the command necessary to create the process.

How does Apache Hadoop YARN work?

YARN separates HDFS and MapReduce, making the Hadoop environment more suitable for applications that can’t wait for the batch processing jobs to get finished. So, no more batch processing delays with YARN! This architecture lets you process data with multiple processing engines using real-time streaming, interactive SQL, batch processing, handling of data stored in a single platform, and working with analytics in a completely different manner. It can be considered as the basis of the next generation of Hadoop ecosystem, ensuring that the forward-thinking organizations are realizing the modern data architecture.

How is an application submitted in YARN?

1. Submit the job
2. Get an application ID
3. Retrieval of the context of application submission

  • Start Container Launch
  • Launch Application Master

4. Allocate Resources.

  • Container
  • Launching

5. Executing

Workflow of an Application in YARN

  1. Submission of the application by Client
  2. Container allocation for starting Application Manager
  3. Registering the Application Manager with Resource Manager
  4. Application Manager asks for containers from Resource Manager
  5. Application Manager notifies Node Manager to launch containers
  6. Application code gets executed in the container
  7. Client contacts Resource Manager/Application Manager to monitor the status of the application
  8. Application Manager gets disconnected with Resource Manager

Features of YARN

  • High-degree compatibility: Applications created use the MapReduce framework that can be run easily on YARN.
  • Better cluster utilization: YARN allocates all cluster resources in an efficient and dynamic manner, which leads to better utilization of Hadoop as compared to the previous version of it.
  • Utmost scalability: Whenever there is an increase in the number of nodes in the Hadoop cluster, the YARN Resource Manager assures that it meets the user requirements.
  • Multi-tenancy: Various engines that access data on the Hadoop cluster can efficiently work together all because of YARN as it is a highly versatile technology.

YARN vs MapReduce

In Hadoop 1.x, the batch processing framework MapReduce was closely paired with HDFS. With the addition of YARN to these two components, giving birth to Hadoop 2.x, came a lot of differences in the ways in which Hadoop worked. Let’s go through these differences.

CriteriaYARNMapReduce
Type of processingReal-time, batch, and interactive processing with multiple enginesSilo and batch processing with a single engine
Cluster resource optimizationExcellent due to central resource managementAverage due to fixed Map and Reduce slots
Suitable forMapReduce and non-MapReduce applicationsOnly MapReduce applications
Managing cluster resourceDone by YARNDone by JobTracker
NamespaceHadoop supports multiple namespacesSupports only one namespace, i.e., HDFS