Hadoop Yarn Fair scheduler advantages.. explained... part1
Ram Ghadiyaram
Technologist, Thought leader, Mentor, Innovator, Speaker - Cloud | Databricks | Snow Flake | Apache/PY Spark | Big Data | Analytics | AI | ML | LLM at JPMorgan Chase & Co
What is Fair :
Keywords: Hadoop, MapReduce, task scheduling, yet another resource negotiator, YARN, Hadoop distributed file system, HDFS, JobTracker, TaskTracker
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness
Problem statement : If FIFO is the scheduler which is used. Which will queue up the jobs sharing. and our QA was always complaining of either jobs are not executed or job executions are very slow. since all yarn jobs piled up in the queue.
Configuring fair schedular for yarn jobs
advantages :
1) To meet requirements of multi tenant systems .
2) All the jobs gets equal share of resources.
When only one job present occupies entire cluster. As other jobs arrive each job given equal % of cluster.
Example : Each job might be given equal number of cluster wide Yarn containers.
Each container = 1 task of job
Divides cluster into pools
- Typically one pool per user/module.
- Resources divided equally among pools.
- Gives each user fair share of cluster
- With in each pool, can use either
- Fair share scheduling, or FIFO/Fair configurable.
- Some pools may have minimum shares.
minimum % of cluster that pool is guaranteed.
When minimum share not met in a pool, for a while
Take resources away from other pools
- By Pre-empting jobs in those other pools
- By killing the currently running tasks of those jobs
tasks can be restarted later. since tasks are idempotent.
Note : Preempting is not allowed in Hadoop capacity scheduler
To kill, the scheduler picks most recently started tasks
Can also sets limits on
- Number of concurrent jobs per user
- Number of concurrent jobs per pool.
- Number of concurrent tasks per pool.
Prevents cluster from being hogged by one user/module/job
Allocations with Hadoop Yarn Fair scheduler example.
With fair scheduling of yarn was able to address above problem by creating dedicated pools for each module.
will discuss fair scheduler configuration in my next post i.e. part2