登录查看更多内容

Understanding Spark on YARN Architecture

Sachin D N ????

Data Consultant @ Lumen Technologies | Data Engineer | Big Data Engineer | AWS | Azure | Apache Spark | Databricks | Delta Lake | Agile | PySpark | Hadoop | Python | SQL | Hive | Data Lake | ADF | Data Warehousing | DLT

发布日期: 2024年3月17日

Apache Spark is a powerful, in-memory data processing engine with robust and expressive development APIs. It enables data workers to efficiently execute streaming, machine learning, or SQL workloads that require fast iterative access to datasets. Hadoop YARN (Yet Another Resource Negotiator), on the other hand, is a resource management platform responsible for managing compute resources in clusters and using them for scheduling users' applications.

When Spark runs on YARN, it provides the resource management, while Spark provides data processing tasks. This allows Spark to leverage YARN's powerful features for managing and scheduling resources across the cluster.

Spark on YARN Architecture

In a Spark on YARN setup, the architecture consists of the following components:

1. Client: This is the machine where the Spark job is initiated.

2. Resource Manager (RM): The Resource Manager is the master process in YARN, responsible for resource assignment and management among all the applications.

3. Node Manager (NM): The Node Manager is the per-machine agent who is responsible for launching the applications' containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager.

4. Application Master (AM): The Application Master negotiates resources from the Resource Manager and works with the Node Manager to execute and monitor the tasks. In the context of Spark on YARN, the Application Master is responsible for the execution of the Spark job.

Running Spark on YARN

When a Spark job is submitted from a client machine, the following steps occur:

1. The client contacts the Resource Manager to create an application, and the Resource Manager responds with an application ID.

2. The client then launches the Application Master in a container. The Application Master registers with the Resource Manager and requests resources.

领英推荐

WAT IS SPARK

Ashish Ranjan 1 年前

Unlocking the Power of Apache Spark: A Comprehensive…

Udaya G 4 周前

How to implement Apache Spark in Data Processing and…

Spiral Mantra 10 个月前

3. The Resource Manager allocates containers for the Application Master, and the Application Master then communicates with the corresponding Node Managers to launch the containers.

4. Once the containers are launched, the Spark tasks run on these containers.

5. The Application Master continually requests resources from the Resource Manager to run all Spark tasks. It also monitors the task status and provides updates to the client.

6. When the job is done, the Application Master deregisters with the Resource Manager and shuts down, freeing up the resources for other tasks.

Spark on YARN Modes

Spark on YARN can run in two modes:

1. Cluster Mode: In this mode, the Spark driver runs inside the Application Master on a cluster host. This mode is suitable for production jobs as it allows the application to be completely run within the YARN cluster.

2. Client Mode: In this mode, the driver runs on the client machine. This mode is suitable for interactive and debugging workloads where you want to see your application's output immediately.

In conclusion, running Spark on YARN allows we to take advantage of the resource management capabilities of YARN while benefiting from the data processing power of Spark. This combination is particularly useful in a Hadoop ecosystem where we can run Spark alongside other YARN applications.

#ApacheSpark #DistributedProcessing #DataFrame #BigDataAnalytics #DataEngineering #DataProcessing #Yarn #SparkResourceManager

要查看或添加评论，请登录

Sachin D N ????的更多文章

Windowing Functions

2024年3月25日

Windowing Functions

Windowing functions in PySpark and Spark SQL provide powerful ways to perform calculations against a group, or…

1 条评论
Aggregation Functions in PySpark

2024年3月22日

Aggregation Functions in PySpark

Apache Spark is a powerful open-source processing engine for big data built around speed, ease of use, and…

2 条评论
Accessing Columns in PySpark: A Comprehensive Guide

2024年3月20日

Accessing Columns in PySpark: A Comprehensive Guide

Apache Spark is a powerful open-source processing engine for big data built around speed, ease of use, and…
Deep Dive into Persist in Apache Spark

2024年3月15日

Deep Dive into Persist in Apache Spark

Apache Spark is a powerful open-source processing engine for big data. One of its key features is the ability to…

2 条评论
Deep Dive into Caching in Apache Spark

2024年3月14日

Deep Dive into Caching in Apache Spark

Apache Spark is a robust open-source processing engine for big data. One of its key features is the ability to cache…

1 条评论
Mastering Spark Session Creation and Configuration in Apache Spark

2024年3月13日

Mastering Spark Session Creation and Configuration in Apache Spark

Apache Spark is a powerful open-source processing engine for big data. At the heart of Spark's functionality is the…
Mastering DataFrame Transformations in Apache Spark

2024年3月12日

Mastering DataFrame Transformations in Apache Spark

Apache Spark's DataFrame API provides powerful transformations that can be used to manipulate data. In this blog post…

2 条评论
Handling Nested Schema in Apache Spark

2024年3月11日

Handling Nested Schema in Apache Spark

Apache Spark provides powerful tools for working with complex, nested data structures. In this blog, we'll explore two…
Different Ways of Creating a DataFrame in Spark

2024年3月5日

Different Ways of Creating a DataFrame in Spark

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics…

4 条评论
?? Understanding Apache Spark Executors

2024年2月12日

?? Understanding Apache Spark Executors

Apache Spark is renowned for its distributed data processing capabilities, achieved by distributing tasks across a…

See all articles

Understanding Spark on YARN Architecture

Sachin D N ????

Data Consultant @ Lumen Technologies | Data Engineer | Big Data Engineer | AWS | Azure | Apache Spark | Databricks | Delta Lake | Agile | PySpark | Hadoop | Python | SQL | Hive | Data Lake | ADF | Data Warehousing | DLT

领英推荐

Sachin D N ????的更多文章

社区洞察

其他会员也浏览了

Apache Spark: The Ultimate Big Data Processing Engine

WHAT IS SPARK

Apache Spark

Apache Spark: Revolutionizing Big Data Processing

Understanding Apache Spark: How It Works, Its Main Purpose, and Limitations

Apache Spark?-?Data Engineering

Spark Architecture

Get your hands-on PySpark to solve Kaggle problems

SPARK

领英推荐

Sachin D N ????的更多文章

Windowing Functions

Aggregation Functions in PySpark

Accessing Columns in PySpark: A Comprehensive Guide

Deep Dive into Persist in Apache Spark

Deep Dive into Caching in Apache Spark

Mastering Spark Session Creation and Configuration in Apache Spark

Mastering DataFrame Transformations in Apache Spark

Handling Nested Schema in Apache Spark

Different Ways of Creating a DataFrame in Spark

?? Understanding Apache Spark Executors

社区洞察

其他会员也浏览了

Apache Spark: The Ultimate Big Data Processing Engine

WHAT IS SPARK

Apache Spark

Apache Spark: Revolutionizing Big Data Processing

Understanding Apache Spark: How It Works, Its Main Purpose, and Limitations

Apache Spark?-?Data Engineering

Spark Architecture

Get your hands-on PySpark to solve Kaggle problems

SPARK