登录查看更多内容

Apache YARN: The Resource Manager for Hadoop Ecosystem

Lashman Bala

Data Engineer

发布日期: 2025年2月26日

Introduction

Apache YARN (Yet Another Resource Negotiator) is the cluster resource management layer in Hadoop, introduced in Hadoop 2.0. It was designed to overcome the limitations of the traditional MapReduce framework by decoupling resource management and job scheduling. YARN allows multiple applications (not just MapReduce) to share the same Hadoop cluster efficiently.

In this blog, we will dive into YARN's architecture, its components, how it schedules jobs, and why it is crucial for modern big data processing.

1. Why YARN?

Before YARN, Hadoop relied on MapReduce v1, which had several limitations:

Single JobTracker Bottleneck – JobTracker was responsible for both resource allocation and job scheduling, leading to performance issues.
Poor Resource Utilization – Map and Reduce slots were fixed, leading to underutilization of resources.
Not Suitable for Multiple Frameworks – Hadoop could run only MapReduce jobs, limiting flexibility.

YARN solved these issues by introducing a centralized resource manager that dynamically allocates resources and supports multiple execution engines like Spark, Tez, Flink, and others alongside MapReduce.

2. YARN Architecture

YARN consists of the following key components:

A. ResourceManager (RM)

Master daemon that allocates cluster resources.
Consists of:Scheduler – Allocates resources based on user-defined constraints (CPU, memory).ApplicationManager (AM) – Manages the lifecycle of applications.

B. NodeManager (NM)

Runs on each worker node in the cluster.
Manages containers, which are resource units allocated for tasks.
Monitors resource usage (CPU, memory, disk, network) and reports to RM.

C. ApplicationMaster (AM)

Created for each application/job submitted to YARN.
Negotiates resources with RM and manages task execution.
Works with NodeManagers to launch and monitor tasks inside containers.

D. Containers

Basic execution units in YARN.
Each container gets a fixed amount of CPU and memory allocated dynamically by RM.
Containers run MapReduce, Spark, or any other framework's tasks.

3. How YARN Works (Job Execution Flow)

Step 1: Job Submission

A client submits a job (e.g., a Spark job) to the ResourceManager.

Step 2: Resource Allocation

RM assigns the job to an ApplicationMaster (AM), which negotiates resources.
AM requests containers from RM based on job needs.

Step 3: Task Execution

Containers are launched on NodeManagers.
AM monitors task execution and reschedules failed tasks if necessary.

Step 4: Job Completion

Once all tasks are completed, AM notifies RM.
RM releases allocated resources.

领英推荐

Why do we need Hadoop for Data Science - NareshIT

Naresh i Technologies 2 年前

4. YARN Schedulers

The Scheduler in YARN determines how cluster resources are distributed among jobs. Common scheduling policies include:

A. FIFO Scheduler

First In, First Out queue-based scheduling.
Simple but inefficient as large jobs block smaller jobs.

B. Capacity Scheduler

Allocates resources based on pre-defined queues.
Each queue gets a guaranteed minimum share of resources, ensuring fair usage.
Ideal for multi-tenant clusters where different teams share the same Hadoop cluster.

C. Fair Scheduler

Ensures equal resource distribution among running applications.
Small jobs don’t starve because large jobs don’t consume all resources.
Used by organizations running multiple workloads simultaneously

5. Benefits of YARN

? Better Resource Utilization – Dynamic allocation prevents resource wastage.

? Supports Multiple Frameworks – Runs Spark, Tez, Flink, and more, alongside MapReduce.

? Scalability – Supports thousands of nodes in a cluster.

? Fault Tolerance – Reschedules failed tasks efficiently.

? Multi-tenancy – Different teams can share the same cluster without interference.

6. YARN vs. Traditional MapReduce: A Comparison

1. Resource Management

Traditional MapReduce: Uses a fixed number of map and reduce slots, leading to resource underutilization.
YARN: Dynamically allocates resources based on demand, improving cluster efficiency.

2. Scalability

Traditional MapReduce: Limited scalability due to static resource allocation.
YARN: Scales better by allowing multiple applications to share cluster resources dynamically.

3. Flexibility

Traditional MapReduce: Supports only MapReduce jobs.
YARN: Supports multiple processing frameworks (Spark, Tez, Flink, etc.).

4. Fault Tolerance

Traditional MapReduce: JobTracker is a single point of failure.
YARN: Separates resource management (ResourceManager) and job scheduling (ApplicationMaster), reducing failure impact.

5. Cluster Utilization

Traditional MapReduce: Fixed slots often lead to inefficient resource usage.
YARN: Resources are allocated dynamically, ensuring better utilization.

Conclusion

Apache YARN is a powerful resource management system that has transformed the Hadoop ecosystem. By separating resource management from job execution, YARN enables efficient cluster utilization, multi-tenancy, and support for various big data frameworks.

If you're working with Spark, MapReduce, or other distributed frameworks, understanding YARN is crucial for optimizing performance and resource allocation.

要查看或添加评论，请登录

Lashman Bala的更多文章

AWS S3: Ultimate Guide to Simple Storage Service

2025年3月25日

AWS S3: Ultimate Guide to Simple Storage Service

Introduction to S3 Amazon Simple Storage Service (S3) is a scalable, high-speed, low-cost object storage service…
Databricks: The Unified Data Analytics Platform

2025年3月24日

Databricks: The Unified Data Analytics Platform

Introduction In the era of big data and AI, businesses need scalable, unified, and cost-efficient platforms to handle…

1 条评论
DBT : A Comprehensive Guide to Data Build Tool

2025年3月22日

DBT : A Comprehensive Guide to Data Build Tool

Introduction to dbt Modern data teams need efficient ways to transform raw data into meaningful insights. dbt (Data…

1 条评论
Delta Lake: An Open Table Format for Reliable Lakehouse architecture

2025年3月21日

Delta Lake: An Open Table Format for Reliable Lakehouse architecture

The explosion of big data has led to a growing need for efficient, scalable, and reliable data management solutions…

1 条评论
Understanding Apache Airflow: A Comprehensive Guide

2025年3月20日

Understanding Apache Airflow: A Comprehensive Guide

Apache Airflow is a powerful open-source platform used for automating, scheduling, and monitoring complex workflows…
Apache Kafka: A Deep Dive into Distributed Event Streaming

2025年3月19日

Apache Kafka: A Deep Dive into Distributed Event Streaming

Introduction In the era of big data, organizations generate massive amounts of data that need to be processed, stored…
Apache Spark Structured Streaming

2025年3月6日

Apache Spark Structured Streaming

Introduction Apache Spark Structured Streaming is a scalable, fault-tolerant stream processing engine built on top of…
Apache Spark: The Ultimate Big Data Processing Engine

2025年3月4日

Apache Spark: The Ultimate Big Data Processing Engine

1. Introduction to Apache Spark What is Apache Spark? Apache Spark is a lightning-fast, distributed computing framework…

1 条评论
Apache Hive: A Data Warehouse Solution on Hadoop

2025年2月28日

Apache Hive: A Data Warehouse Solution on Hadoop

Introduction Apache Hive is a data warehouse infrastructure built on top of Hadoop that allows users to query and…
Understanding HDFS: The Backbone of Big Data Processing

2025年2月25日

Understanding HDFS: The Backbone of Big Data Processing

In today’s data-driven world, the ability to store and process vast amounts of data efficiently is critical. This is…

See all articles

Apache YARN: The Resource Manager for Hadoop Ecosystem

Lashman Bala

Data Engineer

Introduction

1. Why YARN?

2. YARN Architecture

3. How YARN Works (Job Execution Flow)

领英推荐

4. YARN Schedulers

5. Benefits of YARN

6. YARN vs. Traditional MapReduce: A Comparison

Conclusion

Lashman Bala的更多文章

社区洞察

其他会员也浏览了

APACHE HADOOP & HDFS

Introduction to Hadoop

Is Hadoop the New HPC?

YARN & MapR, YARN Requirements and YARN Frameworks

Unraveling the Power of Hadoop: A Step-by-Step Guide to Installation on Ubuntu

Apache Hadoop

What is Apache Tez?

Mastering Hadoop Installation on Ubuntu: A Comprehensive Guide

The Story of Hadoop: How it Grew to Tackle Big Data Challenges

Best Books to Master Hadoop2.x feature - Apache Yarn

Introduction

1. Why YARN?

2. YARN Architecture

3. How YARN Works (Job Execution Flow)

领英推荐

4. YARN Schedulers

5. Benefits of YARN

6. YARN vs. Traditional MapReduce: A Comparison

Conclusion

Lashman Bala的更多文章

AWS S3: Ultimate Guide to Simple Storage Service

Databricks: The Unified Data Analytics Platform

DBT : A Comprehensive Guide to Data Build Tool

Delta Lake: An Open Table Format for Reliable Lakehouse architecture

Understanding Apache Airflow: A Comprehensive Guide

Apache Kafka: A Deep Dive into Distributed Event Streaming

Apache Spark Structured Streaming

Apache Spark: The Ultimate Big Data Processing Engine

Apache Hive: A Data Warehouse Solution on Hadoop

Understanding HDFS: The Backbone of Big Data Processing

社区洞察

其他会员也浏览了

APACHE HADOOP & HDFS

Introduction to Hadoop

Is Hadoop the New HPC?

YARN & MapR, YARN Requirements and YARN Frameworks

Unraveling the Power of Hadoop: A Step-by-Step Guide to Installation on Ubuntu

Apache Hadoop

What is Apache Tez?

Mastering Hadoop Installation on Ubuntu: A Comprehensive Guide

The Story of Hadoop: How it Grew to Tackle Big Data Challenges

Best Books to Master Hadoop2.x feature - Apache Yarn