June 26, 2022

Introduction to YARN in Hadoop

In order to address the scalability issues in MapReduce1 a new cluster management system was designed which is known as YARN (Yet Another Resource Negotiator). Yarn was introduced in Hadoop 2.x versions and it is also known as MapReduce2. This post gives an introduction to YARN in Hadoop, also talks about YARN architecture and flow.

Problems in MapReduce1

In MapReduce1, JobTracker was doing the job of both job scheduling as well as keeping track of running tasks like progress made by tasks, running the failed job again. This over dependence on JobTracker was causing scalability issue in very large clusters.

Apache YARN

In YARN the functionality of resource management and job scheduling/monitoring is split between two separate daemons.

There is a ResourceManager to manage the resources across the cluster and there is a per-application ApplicationMaster to manage the application.

Though YARN is also known as MapReduce2 but YARN in Hadoop is designed to be more generic. In YARN the per application ApplicationMaster is the framework specific library. So any distributed computing framework which is built on YARN can be executed as a YARN application. So a single Hadoop cluster can run MapReduce, Spark, Storm, Tez and many more such distributed frameworks that too simultaneously.

Architecture of YARN in Hadoop

In YARN there are two long running daemons ResourceManager and the NodeManager that form the data computation framework.

Then there is a per application ApplicationMaster that is application specific.

ResourceManager in YARN– ResourceManager is the master daemon, it arbitrates resources among all the applications in the system. ResourceManager has information about the nodes and resources in the cluster and it is the decision taking authority how and when to provide resources to any application.

The ResourceManager has two main components- Scheduler and ApplicationsManager.

  • Scheduler- The Scheduler is responsible for allocating resources to the various running applications. Scheduler does not perform monitoring or tracking of status for the application.
  • ApplicationsManager- The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

NodeManeger in YARN- The NodeManager daemon runs on each node in the cluster. It is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager.

ApplicationMaster in YARN- The ApplicationMaster is started per application. It has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

YARN Application execution flow

When a client application is submitted it goes to ResourceManager first. ResourceManager maintains the list of all the applications running on the cluster and cluster resources in use.

ResourceManager has to decide which submitted application to run next. That is done by the Scheduler part of the ResourceManager.

ApplicationsManager part of the ResourceManager will negotiate the first container where the application specific ApplicationMaster can be executed.

As example– If submitted application is a MapReduce application it will start a MRAppMaster in a container.

Based on the further requirements of the application more resource containers will be negotiated from the Scheduler by the ApplicationMaster itself.

Once a container is granted by ResourceManager to the ApplicationMaster for running its task, ApplicationMaster will communicate with the NodeManager running on the node where the container is allocated to launch and manage the resources of the container.

The NodeManager is responsible for launching and managing containers on a node. Containers execute tasks as specified by the AppMaster.

Following image shows the flow with the help of two applications submitted by the users. One is a MapReduce application and another is Spark application.

Two application masters will be started one for MR and another one for Spark application.

YARN in Hadoop

That's all for the topic Introduction to YARN in Hadoop. If something is missing or you have something to share about the topic please write a comment.

You may also like

No comments:

Post a Comment