登录查看更多内容

Apache Spark on YARN Architecture

Nikhil G R

Senior Data Engineer (Apache Spark Developer) @ SAP Labs India, Ex TCS, 3x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, SAP BDC, Datasphere, ADLs, Azure Data factory, MySQL, Delta Lake

发布日期: 2023年11月16日

+ 关注

Before going through the Spark architecture, let us understand the Hadoop ecosystem.

The core components of Hadoop are

HDFS - It acts as a File System (Storage)
Mapreduce - It is used for processing or computation
YARN - It is a Resource Manager

YARN is like an Operating System. It manages the resources.

YARN has two things

Resource Manager (Master)
Node Manager (Slave/ Worker)

How does YARN work?

Consider we are invoking a hadoop job on the client machine.

What will happen now?

The request goes to the Resource Manager. Resource Manager will co-ordinate with one of the Node Manager and create a container in that worker node.

Consider it is connected with Worker Node3.

Inside this container, it will start a service called as Application Master. This Application Master will act as a Local Manager to manage the application.

This Application Master is now responsible to get more resources for the application. It will request the Resource Manager for more resources.

eg) It may request for three containers

2 containers - 2GB RAM, 1 Core, On Worker 1

1 container - 2GB RAM, 1 Core, On Worker 2

What is the purpose of telling about the container name?

Consider we have a 5 node cluster. Let's say we have 300mb file in HDFS which have 3 blocks. This would store on worker 1, worker 3, worker 4 respectively. Consider our Application Master is running on worker 5 and if it connects with Resource Manager for more resources, it should work on the Principle of Data Locality. It should request worker 1, worker 3, worker 4 for the resources.

领英推荐

Why do we need Hadoop for Data Science - NareshIT

Naresh i Technologies 2 年前

Introduction to Hadoop

Simran Rai 1 个月前

Task Efficiency: A Comparative Study of Hadoop…

Mathankumar Selvaraj ?????? 7 个月前

Consider we got containers on worker 1 and worker 2. Node Manager will come in. It manages the containers which are running on worker nodes.

Now the Application Master will interact with Name Node to understand where the blocks of the files are kept in HDFS.

Uber mode - Scenario where the job is so small that it can run in the container in which the Application Master is running. It does not need other containers.

Application Master is called as Driver.
Every Spark Job has one Driver.

Interactive mode - Where we work on Notebook, pyspark shell

Submit the Job - Where we use spark-submit command

What happens in case of Apache Spark?

Consider user executed the spark-submit command.

The request will go to the Resource Manager which will be residing on Master Node.
It will create a container on one of the worker node.
It will start the driver (Application Master) service. This will manage the entire application.

If the driver is crashed, application crashes. One spark application will have one driver.

There are two modes in which Spark run.

Client Mode - Notebook, spark-shell -> Interactive
Cluster Mode - spark-submit -> Production

Client Mode

If the driver is running on Gateway/ Edge node but not on Cluster to see results instantly, then it is Client Mode. Here driver runs outside the cluster.

Cluster Mode

In Cluster mode, driver will be running in one of the worker nodes within the cluster. Even if the gateway node crashes or even when we logout from the gateway node, the application will still run.

To Summarize,

When we invoke a job, the request goes to Resource Manager. The Resource Manager creates an Application Master on one of the worker nodes which will manage the application. This Application Master will request for more resources because it might require more resources on various worker nodes. It will co-ordinate with the Name Node to get to know where the file is stored. Based on that, it will request for resources so that we work on principle of data locality. It will request Resource Manager to get the resources. Resource Manager will provide the containers and executors on those nodes and Node Manager will manage the resources.

In Spark, Application Master can be considered as a Driver. THe Driver can run inside the cluster or outside the cluster. When it runs oustide the cluster, it is called as Client mode. When it runs inside the cluster, it is called as Cluster mode.

Credits - Sumit Mittal sir

要查看或添加评论，请登录

Nikhil G R的更多文章

Introduction to DBT (Data Build Tool)

2024年5月20日

Introduction to DBT (Data Build Tool)

dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse…
DIFFERENCES IN SQL

2024年1月8日

DIFFERENCES IN SQL

WHERE vs HAVING WHERE and HAVING clauses are both used in SQL to filter data. WHERE WHERE clause should be used before…
Introduction to Azure Databricks (Part 2)

2023年12月6日

Introduction to Azure Databricks (Part 2)

DBFS (Databricks File System) It is a Distributed File System. It is mounted into a databricks workspace.
Introduction to Azure Databricks (Part 1)

2023年12月5日

Introduction to Azure Databricks (Part 1)

Databricks is a company created by the creators of Apache Spark. It is an Apache Spark based unified analytics platform…
Aggregate and Window Functions in Pyspark

2023年12月4日

Aggregate and Window Functions in Pyspark

Aggregate Functions These are the functions where the number of output rows will always be less than the number of…
Different ways of creating a Dataframe in Pyspark

2023年11月24日

Different ways of creating a Dataframe in Pyspark

Using spark.read Using spark.
Dataframes and Spark SQL Table

2023年11月23日

Dataframes and Spark SQL Table

Dataframes These are in the form of RDDs with some structure/schema which is not persistent as it is available only in…
Dataframe Reader API

2023年11月22日

Dataframe Reader API

We can read the different format of files using the Dataframe Reader API. Standard way to create a Dataframe Instead of…
repartition vs coalesce in pyspark

2023年11月21日

repartition vs coalesce in pyspark

repartition There can be a case if we need to increase or decrease partitions to get more parallesism. repartition can…

2 条评论
Introduction to Apache spark

2023年11月16日

Introduction to Apache spark

Apache Spark is a Distributed Computing Framework. Before going into Apache Spark let us understand what are the…

1 条评论

See all articles

Apache Spark on YARN Architecture

Nikhil G R

Senior Data Engineer (Apache Spark Developer) @ SAP Labs India, Ex TCS, 3x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, SAP BDC, Datasphere, ADLs, Azure Data factory, MySQL, Delta Lake

How does YARN work?

What will happen now?

What is the purpose of telling about the container name?

领英推荐

What happens in case of Apache Spark?

Client Mode

Cluster Mode

To Summarize,

Nikhil G R的更多文章

社区洞察

其他会员也浏览了

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Spark Vs Hadoop Map Reduce

Hadoop vs Hive

Mastering Big Data: 40 Essential Spark and Hadoop Questions to Ace Your Next Interview

Apache Hadoop vs Apache Spark

HDFS (Hadoop Distributed File System):

Hadoop versus Spark: Who’s winning?

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Hadoop Cluster Revealed

How does YARN work?

What will happen now?

What is the purpose of telling about the container name?

领英推荐

What happens in case of Apache Spark?

Client Mode

Cluster Mode

To Summarize,

Nikhil G R的更多文章

Introduction to DBT (Data Build Tool)

DIFFERENCES IN SQL

Introduction to Azure Databricks (Part 2)

Introduction to Azure Databricks (Part 1)

Aggregate and Window Functions in Pyspark

Different ways of creating a Dataframe in Pyspark

Dataframes and Spark SQL Table

Dataframe Reader API

repartition vs coalesce in pyspark

Introduction to Apache spark

社区洞察

其他会员也浏览了

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Spark Vs Hadoop Map Reduce

Hadoop vs Hive

Mastering Big Data: 40 Essential Spark and Hadoop Questions to Ace Your Next Interview

Apache Hadoop vs Apache Spark

HDFS (Hadoop Distributed File System):

Hadoop versus Spark: Who’s winning?

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Hadoop Cluster Revealed