登录查看更多内容

Understanding YARN (Yet Another Resource Negotiator)

Madhusudhan Rao Mulagala

Big Data Engineer

发布日期: 2023年5月31日

+ 关注

For understanding YARN first, we need to understand Hadoop1/ MR1 Architecture:??

?? From Storage Perspective – HDFS??

? Name Node – Master Node (holds the metadata in the form of tables)??

? Data Node – Slave Node (holds the actual data in terms of blocks) - Block Size = 128MB (by default)??

?? From Processing Perspective -???

In hadoop1, the job execution was controlled by ‘2’ processes.??

Master – Job Tracker??

Slave – Task Tracker??

Ex: Suppose if we have a 400-node cluster then we have 400 Task Trackers running and 1 Job Tracker running.??

?? Role of Job Tracker: Used to do a lot of work in Hadoop1 (Scheduling + Monitoring)??

? Scheduling - deciding which job to execute first based on the scheduling algorithm, priority of jobs, getting to know the available resources, and providing the resources to jobs.??

? Monitoring - Tracking the progress of the Job, if a task fails ==> rerun the task. If the task is slow then based on speculative execution, start on another machine. And this used to become a very hectic task.??

For Job Tracker, there are a lot of scheduling and monitoring activities are there. If you have many tasks, a single job tracker has a lot of work to do in Hadoop1. It is the biggest pain point.??

??Role of Task Tracker: This task tracker tracks the tasks on each data node and informs the job tracker about each task.??

? Summary:??

From this, we conclude that the cluster has one master node and many data nodes. The master node has only one Job Tracker and each data node has one task tracker.??

However, the Job tracker has to do most of the work in terms of scheduling and monitoring. The task tracker just sees the mappers and reducers executing locally and reports back to the Job Tracker.??

It means a job tracker is used to do a lot of work compared to a task tracker.??

?Limitations of MR1:??

? Scalability issues with large clusters. [ It was observed that when the cluster size goes beyond 4k data nodes (Yahoo and Facebook) then the job tracker used to becomes a bottleneck]??

? Less resource utilization - Underutilizing the cluster resources.?[In MR1, there used to be a fixed number of map and reduce slots in a cluster.]??

? Only restricted to MR Jobs. [Only map reduce jobs were supported. It is not generic]??

To solve the above problems, YARN came into the picture:??

YARN:???

?? Yarn has three major components:??

1. Resource Manager (master)??

2. Node Manager (slave)??

3. Application Master.??

领英推荐

Top 10 Big Data Tools & Technologies To Watch Out In…

ITIO Innovex Pvt. Ltd. 10 个月前

Tools for the Data Scientists Working at Scale

StrataScratch 8 个月前

Hive Data Types

Darshika Srivastava 1 年前

In Hadoop V1, a major bottleneck was that ‘Job Tracker’ was doing a lot of work (Scheduling + Monitoring).??

But in Hadoop V2, the Monitoring aspect was taken away from your Job Tracker. Now job tracker won’t do anything other than ‘Scheduling’. And they changed the name of Job Tracker to Resource Manager.???

Just like Task Trackers in Hadoop V1, we have Node Managers in Hadoop V2. Task Trackers are used to monitor the local map and reduce tasks. Similarly, Node Manager will manage local map & reduce tasks.??

?Job Tracker replaces Resource Manager. Task Tracker replaces Node Manager. So, who is doing Monitoring?]??

??YARN Execution Flow:??

1. Client submits a request, which goes to the Resource Manager (the master) - Holding only the Scheduling part.??

2. Now resource manager creates a container on one of the Node Managers (Slave). And ones it creates a container, it will launch the application master in that container for this job only. [ The application master will take care of end-to-end monitoring for this Job (application)]??

3. Now this application master will take care of the entire life cycle of this job.??

4. This application master negotiates with the resource manager for ‘resources’ that are required to run this job in the form of containers. [If Application Master didn’t mention any requirements, default resources are provided]???

5. Once Resource Manager allocates the containers, it will tell the ‘containerId’ and ‘Host Name (The Node Manager on which, the container is given)’. Then the role of the application master is to go to these Node Managers and use those containers to run the task. And then the application master has to manage all these things in terms of failures. [ Suppose if there is any failure then the application master needs to re-execute the tasks.]??

Understanding Each Component of YARN:??

??Resource Manager:???

? As part of scheduling, it has to ensure all resources are available in the cluster, which Node managers are alive, and so on. [ Keeps track of live node managers and available resources.]??

? It allocates available resources to appropriate applications and tasks.??

? Also manages ‘Application Master’.??

Ex: Let's say there are 100 different applications/application masters. The resource manager will just see “if the application master is running or not”. If it is stopped due to some reason or any failure of the application master, the resource manager will make sure that it starts another application master for that job.??

??Node Manager:??

? Node manager will be running on every data node.??

? Node manager provides computational resources in the form of containers.??

? Whatever is running inside a container, the Node manager has to manage that container.??

??Application Master:??

? Coordinates the execution of all tasks within its application.??

??Ex: Let's say there is an application that contains 100 tasks. It’s the role of the application master to handle all those tasks and coordinate among them.??

Also, the application master asks for appropriate resource containers to run tasks.??

Q). How does YARN overcome the limitations of MR1???

? Scalability: With the introduction of YARN, it removes the scalability problem. Because some of the work is delegated to the Application Master which will manage the end-to-end life cycle of an application. That means Scheduling is done by ‘Resource Manager’ and ‘Monitoring’ is done by ‘Application Master’.??

??? Resource Utilization: With the concept of logical containers coming in, the resource allocation is much more dynamic, and we can request any amount of CPU & memory. With this cluster, utilization is improved as the resources are not wasted.??

? Generic: No longer restricted to Map Reduce Jobs. We can have other jobs also.?Ex: Spark, tez, Giraph etc??

Ritu Khabade

Data Engineer at Atlas Copco Group Gecia

1 年

very well explained.

1 次回应

要查看或添加评论，请登录

Madhusudhan Rao Mulagala的更多文章

Classification of Attributes in DBMS

2024年6月16日

Classification of Attributes in DBMS

1. Simple Attribute: A simple attribute is an attribute that cannot be divided further.
Decide which architecture is your best fit:

2024年5月11日

Decide which architecture is your best fit:

For better or for worse, we have a far more complex process of selecting a data warehouse environment that's simply…
Hive: Transforming Data Warehousing for Modern Businesses

2023年11月25日

Hive: Transforming Data Warehousing for Modern Businesses

Hive is an open-source data warehouse. Hive is meant to solve analytical problems.
Dive into Databricks Clusters: The Engine for Data Revolution

2023年11月21日

Dive into Databricks Clusters: The Engine for Data Revolution

A cluster is a collection of virtual machines that helps to achieve distributed data processing. Clusters can be…
Mastering Data Management: A Battle of Transactional and Analytical Systems

2023年10月6日

Mastering Data Management: A Battle of Transactional and Analytical Systems

Transactional systems: Transactional systems are the ones where we deal with day-to-day data or present data. In such…
Mastering Scala: Unveiling the Power of Functional Programming and Functions

2023年8月20日

Mastering Scala: Unveiling the Power of Functional Programming and Functions

? ? ? Scala is a hybrid programming language that supports both object-oriented programming and functional programming.…
Harnessing Broadcast Join and Accumulator Magic!

2023年7月19日

Harnessing Broadcast Join and Accumulator Magic!

? ?? Broadcast Join: It is an optimization technique in the Spark SQL engine that is used to join two Data Frames. This…
HDFS Architecture

2023年6月17日

HDFS Architecture

? Master Node:(Name Node) ? The name node holds the namespace information or metadata (in the form of a table)…
Spark Internals

2023年5月4日

Spark Internals

?? Learning through question-and-answer format has been ingrained in us since childhood, as it promotes active…

See all articles

Understanding YARN (Yet Another Resource Negotiator)

Madhusudhan Rao Mulagala

Big Data Engineer

领英推荐

Madhusudhan Rao Mulagala的更多文章

社区洞察

其他会员也浏览了

Hive Data Types

Big Data – Cluster Environment: Powered by Raspberry Pi-4, Hadoop, and Spark

Big Data Lambda (λ) Architecture variants Explained!

Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

Big Data: The Bigger Picture #4

How to Properly Handle Updates and Deletes in Your Glue Hudi Spark Jobs When Working with CDC Data: Utilizing the _hoodie_is_deleted Flag

Big Data and Its Key Tools: MapReduce, Spark, SQL (Hive), and Hadoop in Action

Setting Up a Delta Lake Solution in Hadoop HDFS and Azure Data Lake Storage: By Fidel Vetino

BigData-Hadoop

Demystifying Big Data: An Insight into MapReduce, Spark, and SQL Hive

领英推荐

Madhusudhan Rao Mulagala的更多文章

Classification of Attributes in DBMS

Decide which architecture is your best fit:

Hive: Transforming Data Warehousing for Modern Businesses

Dive into Databricks Clusters: The Engine for Data Revolution

Mastering Data Management: A Battle of Transactional and Analytical Systems

Mastering Scala: Unveiling the Power of Functional Programming and Functions

Harnessing Broadcast Join and Accumulator Magic!

HDFS Architecture

Spark Internals

社区洞察

其他会员也浏览了

Hive Data Types

Big Data – Cluster Environment: Powered by Raspberry Pi-4, Hadoop, and Spark

Big Data Lambda (λ) Architecture variants Explained!

Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

Big Data: The Bigger Picture #4

How to Properly Handle Updates and Deletes in Your Glue Hudi Spark Jobs When Working with CDC Data: Utilizing the _hoodie_is_deleted Flag

Big Data and Its Key Tools: MapReduce, Spark, SQL (Hive), and Hadoop in Action

Setting Up a Delta Lake Solution in Hadoop HDFS and Azure Data Lake Storage: By Fidel Vetino

BigData-Hadoop

Demystifying Big Data: An Insight into MapReduce, Spark, and SQL Hive