Understanding Spark on YARN Architecture

Understanding Spark on YARN Architecture

Apache Spark is a powerful, in-memory data processing engine with robust and expressive development APIs. It enables data workers to efficiently execute streaming, machine learning, or SQL workloads that require fast iterative access to datasets. Hadoop YARN (Yet Another Resource Negotiator), on the other hand, is a resource management platform responsible for managing compute resources in clusters and using them for scheduling users' applications.

When Spark runs on YARN, it provides the resource management, while Spark provides data processing tasks. This allows Spark to leverage YARN's powerful features for managing and scheduling resources across the cluster.

Spark on YARN Architecture

In a Spark on YARN setup, the architecture consists of the following components:

1. Client: This is the machine where the Spark job is initiated.

2. Resource Manager (RM): The Resource Manager is the master process in YARN, responsible for resource assignment and management among all the applications.

3. Node Manager (NM): The Node Manager is the per-machine agent who is responsible for launching the applications' containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager.

4. Application Master (AM): The Application Master negotiates resources from the Resource Manager and works with the Node Manager to execute and monitor the tasks. In the context of Spark on YARN, the Application Master is responsible for the execution of the Spark job.

Running Spark on YARN

When a Spark job is submitted from a client machine, the following steps occur:

1. The client contacts the Resource Manager to create an application, and the Resource Manager responds with an application ID.

2. The client then launches the Application Master in a container. The Application Master registers with the Resource Manager and requests resources.

3. The Resource Manager allocates containers for the Application Master, and the Application Master then communicates with the corresponding Node Managers to launch the containers.

4. Once the containers are launched, the Spark tasks run on these containers.

5. The Application Master continually requests resources from the Resource Manager to run all Spark tasks. It also monitors the task status and provides updates to the client.

6. When the job is done, the Application Master deregisters with the Resource Manager and shuts down, freeing up the resources for other tasks.

Spark on YARN Modes

Spark on YARN can run in two modes:

1. Cluster Mode: In this mode, the Spark driver runs inside the Application Master on a cluster host. This mode is suitable for production jobs as it allows the application to be completely run within the YARN cluster.

2. Client Mode: In this mode, the driver runs on the client machine. This mode is suitable for interactive and debugging workloads where you want to see your application's output immediately.

In conclusion, running Spark on YARN allows we to take advantage of the resource management capabilities of YARN while benefiting from the data processing power of Spark. This combination is particularly useful in a Hadoop ecosystem where we can run Spark alongside other YARN applications.


#ApacheSpark #DistributedProcessing #DataFrame #BigDataAnalytics #DataEngineering #DataProcessing #Yarn #SparkResourceManager

要查看或添加评论,请登录

Sachin D N ????的更多文章

  • Windowing Functions

    Windowing Functions

    Windowing functions in PySpark and Spark SQL provide powerful ways to perform calculations against a group, or…

    1 条评论
  • Aggregation Functions in PySpark

    Aggregation Functions in PySpark

    Apache Spark is a powerful open-source processing engine for big data built around speed, ease of use, and…

    2 条评论
  • Accessing Columns in PySpark: A Comprehensive Guide

    Accessing Columns in PySpark: A Comprehensive Guide

    Apache Spark is a powerful open-source processing engine for big data built around speed, ease of use, and…

  • Deep Dive into Persist in Apache Spark

    Deep Dive into Persist in Apache Spark

    Apache Spark is a powerful open-source processing engine for big data. One of its key features is the ability to…

    2 条评论
  • Deep Dive into Caching in Apache Spark

    Deep Dive into Caching in Apache Spark

    Apache Spark is a robust open-source processing engine for big data. One of its key features is the ability to cache…

    1 条评论
  • Mastering Spark Session Creation and Configuration in Apache Spark

    Mastering Spark Session Creation and Configuration in Apache Spark

    Apache Spark is a powerful open-source processing engine for big data. At the heart of Spark's functionality is the…

  • Mastering DataFrame Transformations in Apache Spark

    Mastering DataFrame Transformations in Apache Spark

    Apache Spark's DataFrame API provides powerful transformations that can be used to manipulate data. In this blog post…

    2 条评论
  • Handling Nested Schema in Apache Spark

    Handling Nested Schema in Apache Spark

    Apache Spark provides powerful tools for working with complex, nested data structures. In this blog, we'll explore two…

  • Different Ways of Creating a DataFrame in Spark

    Different Ways of Creating a DataFrame in Spark

    Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics…

    4 条评论
  • ?? Understanding Apache Spark Executors

    ?? Understanding Apache Spark Executors

    Apache Spark is renowned for its distributed data processing capabilities, achieved by distributing tasks across a…

社区洞察

其他会员也浏览了