June 26, 2022

Capacity Scheduler in Yarn

This post talks about Capacity Scheduler in YARN which is a pluggable scheduler provided in Hadoop framework. Capacity Scheduler improves the multi tenancy of the shared cluster by allocating a certain capacity of the overall cluster to each organization sharing the cluster.

Capacity Scheduler overview

Rather than setting independent cluster for organizational needs it makes more business sense to share clusters between organizations as that is more cost-effective rather than running large Hadoop installations independently.

With a shared cluster comes the fear; will we get the required resource when we need to run a big job or some other organization will be exhausting all the resources. That’s where Capacity Scheduler in YARN helps by guaranteeing capacity to each organization.

How Capacity Scheduler in YARN works

In a CapacityScheduler each organization gets its own queue with a portion of the cluster capacity configured for their queue.

The CapacityScheduler supports hierarchical queues which means an organization can create sub-queues with in its dedicated queue. The portion of the cluster resource allocated to the queue can further be divided among the sub-queues.

There is an added benefit that an organization can exceed its queue capacity and use more cluster resources than allotted to it, only if there is excess capacity available not being used by others. This provides elasticity for the organizations in a cost-effective manner.

Security in CapacityScheduler

In a shared cluster, security becomes very important. For Each queue there is an access control list (ACL) which controls which users can submit applications to individual queues.

It is also ensured that users cannot view and/or modify applications from other users in other queues. Also, per-queue and system administrator roles are supported.

Configuration for YARN CapacityScheduler

To configure the ResourceManager to use the CapacityScheduler, set the following property in the conf/yarn-site.xml:

<property>
  <name>yarn.resourcemanager.scheduler.class</name>      
  <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

Setting up queues

Properties for setting up queues are as follows. These changes are done in the configuration file etc/hadoop/capacity-scheduler.xml. Note that the CapacityScheduler has a predefined queue called root. All queues in the system are children of the root queue.

For setting up further queues– yarn.scheduler.capacity.root.queues

You need to provides a list of comma-separated child queues.

To set up sub-queues– yarn.scheduler.capacity.<queue-path>.queues

For configuring queue capacity- yarn.scheduler.capacity.<queue-path>.capacity

Queue capacity in percentage (%). The sum of capacities for all queues, at each level, must be equal to 100.

Maximum queue capacity- yarn.scheduler.capacity.<queue-path>.maximum-capacity

Maximum queue capacity in percentage. This limits the elasticity for applications in the queue. Defaults to -1 which disables it.

As example– If there are two top level child queues sales and finance. With in sales queues there are two sub-queues apac and emea.

<property>
  <name>yarn.scheduler.capacity.root.queues</name>
  <value>sales, finance</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.sales.queues</name>
  <value>apac,emea</value>
</property>
If you want to give 70% of the queue capacity to sales and 30% to finance.
<property>
  <name>yarn.scheduler.capacity.root.sales.capacity</name>
  <value>70</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.finance.capacity</name>
  <value>30</value>
</property>
For the two sub-queues with in sales queue if you want to allocate 65% to apac and 35% to emea.
<property>
  <name>yarn.scheduler.capacity.root.sales.apac.capacity</name>
  <value>65</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.sales.emea.capacity</name>
  <value>35</value>
</property>

If you want to limit the elasticity for sales and want to ensure that sales queue doesn’t use more than 80% of the cluster resources, even if resources are available.

<property>
  <name>yarn.scheduler.capacity.root.sales.maximum-capacity</name>
  <value>80</value>
</property>

Reference : https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

That's all for the topic Capacity Scheduler in Yarn. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment