Apache Spark on YARN Architecture

Apache Spark on YARN Architecture

Before going through the Spark architecture, let us understand the Hadoop ecosystem.

The core components of Hadoop are

  • HDFS - It acts as a File System (Storage)
  • Mapreduce - It is used for processing or computation
  • YARN - It is a Resource Manager

YARN is like an Operating System. It manages the resources.

YARN has two things

  • Resource Manager (Master)
  • Node Manager (Slave/ Worker)

How does YARN work?


Consider we are invoking a hadoop job on the client machine.

What will happen now?

The request goes to the Resource Manager. Resource Manager will co-ordinate with one of the Node Manager and create a container in that worker node.

Consider it is connected with Worker Node3.

Inside this container, it will start a service called as Application Master. This Application Master will act as a Local Manager to manage the application.

This Application Master is now responsible to get more resources for the application. It will request the Resource Manager for more resources.

eg) It may request for three containers

2 containers - 2GB RAM, 1 Core, On Worker 1

1 container - 2GB RAM, 1 Core, On Worker 2

What is the purpose of telling about the container name?

Consider we have a 5 node cluster. Let's say we have 300mb file in HDFS which have 3 blocks. This would store on worker 1, worker 3, worker 4 respectively. Consider our Application Master is running on worker 5 and if it connects with Resource Manager for more resources, it should work on the Principle of Data Locality. It should request worker 1, worker 3, worker 4 for the resources.


Consider we got containers on worker 1 and worker 2. Node Manager will come in. It manages the containers which are running on worker nodes.

Now the Application Master will interact with Name Node to understand where the blocks of the files are kept in HDFS.

Uber mode - Scenario where the job is so small that it can run in the container in which the Application Master is running. It does not need other containers.

  • Application Master is called as Driver.
  • Every Spark Job has one Driver.

Interactive mode - Where we work on Notebook, pyspark shell

Submit the Job - Where we use spark-submit command

What happens in case of Apache Spark?

Consider user executed the spark-submit command.

  • The request will go to the Resource Manager which will be residing on Master Node.
  • It will create a container on one of the worker node.
  • It will start the driver (Application Master) service. This will manage the entire application.

If the driver is crashed, application crashes. One spark application will have one driver.

There are two modes in which Spark run.

  • Client Mode - Notebook, spark-shell -> Interactive
  • Cluster Mode - spark-submit -> Production

Client Mode

If the driver is running on Gateway/ Edge node but not on Cluster to see results instantly, then it is Client Mode. Here driver runs outside the cluster.

Cluster Mode

In Cluster mode, driver will be running in one of the worker nodes within the cluster. Even if the gateway node crashes or even when we logout from the gateway node, the application will still run.

To Summarize,

When we invoke a job, the request goes to Resource Manager. The Resource Manager creates an Application Master on one of the worker nodes which will manage the application. This Application Master will request for more resources because it might require more resources on various worker nodes. It will co-ordinate with the Name Node to get to know where the file is stored. Based on that, it will request for resources so that we work on principle of data locality. It will request Resource Manager to get the resources. Resource Manager will provide the containers and executors on those nodes and Node Manager will manage the resources.

In Spark, Application Master can be considered as a Driver. THe Driver can run inside the cluster or outside the cluster. When it runs oustide the cluster, it is called as Client mode. When it runs inside the cluster, it is called as Cluster mode.


Credits - Sumit Mittal sir

要查看或添加评论,请登录

Nikhil G R的更多文章

  • Introduction to DBT (Data Build Tool)

    Introduction to DBT (Data Build Tool)

    dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse…

  • DIFFERENCES IN SQL

    DIFFERENCES IN SQL

    WHERE vs HAVING WHERE and HAVING clauses are both used in SQL to filter data. WHERE WHERE clause should be used before…

  • Introduction to Azure Databricks (Part 2)

    Introduction to Azure Databricks (Part 2)

    DBFS (Databricks File System) It is a Distributed File System. It is mounted into a databricks workspace.

  • Introduction to Azure Databricks (Part 1)

    Introduction to Azure Databricks (Part 1)

    Databricks is a company created by the creators of Apache Spark. It is an Apache Spark based unified analytics platform…

  • Aggregate and Window Functions in Pyspark

    Aggregate and Window Functions in Pyspark

    Aggregate Functions These are the functions where the number of output rows will always be less than the number of…

  • Different ways of creating a Dataframe in Pyspark

    Different ways of creating a Dataframe in Pyspark

    Using spark.read Using spark.

  • Dataframes and Spark SQL Table

    Dataframes and Spark SQL Table

    Dataframes These are in the form of RDDs with some structure/schema which is not persistent as it is available only in…

  • Dataframe Reader API

    Dataframe Reader API

    We can read the different format of files using the Dataframe Reader API. Standard way to create a Dataframe Instead of…

  • repartition vs coalesce in pyspark

    repartition vs coalesce in pyspark

    repartition There can be a case if we need to increase or decrease partitions to get more parallesism. repartition can…

    2 条评论
  • Introduction to Apache spark

    Introduction to Apache spark

    Apache Spark is a Distributed Computing Framework. Before going into Apache Spark let us understand what are the…

    1 条评论

社区洞察

其他会员也浏览了