Apache Spark on YARN Architecture
Nikhil G R
Senior Data Engineer (Apache Spark Developer) @ SAP Labs India, Ex TCS, 3x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, SAP BDC, Datasphere, ADLs, Azure Data factory, MySQL, Delta Lake
Before going through the Spark architecture, let us understand the Hadoop ecosystem.
The core components of Hadoop are
YARN is like an Operating System. It manages the resources.
YARN has two things
How does YARN work?
Consider we are invoking a hadoop job on the client machine.
What will happen now?
The request goes to the Resource Manager. Resource Manager will co-ordinate with one of the Node Manager and create a container in that worker node.
Consider it is connected with Worker Node3.
Inside this container, it will start a service called as Application Master. This Application Master will act as a Local Manager to manage the application.
This Application Master is now responsible to get more resources for the application. It will request the Resource Manager for more resources.
eg) It may request for three containers
2 containers - 2GB RAM, 1 Core, On Worker 1
1 container - 2GB RAM, 1 Core, On Worker 2
What is the purpose of telling about the container name?
Consider we have a 5 node cluster. Let's say we have 300mb file in HDFS which have 3 blocks. This would store on worker 1, worker 3, worker 4 respectively. Consider our Application Master is running on worker 5 and if it connects with Resource Manager for more resources, it should work on the Principle of Data Locality. It should request worker 1, worker 3, worker 4 for the resources.
领英推荐
Consider we got containers on worker 1 and worker 2. Node Manager will come in. It manages the containers which are running on worker nodes.
Now the Application Master will interact with Name Node to understand where the blocks of the files are kept in HDFS.
Uber mode - Scenario where the job is so small that it can run in the container in which the Application Master is running. It does not need other containers.
Interactive mode - Where we work on Notebook, pyspark shell
Submit the Job - Where we use spark-submit command
What happens in case of Apache Spark?
Consider user executed the spark-submit command.
If the driver is crashed, application crashes. One spark application will have one driver.
There are two modes in which Spark run.
Client Mode
If the driver is running on Gateway/ Edge node but not on Cluster to see results instantly, then it is Client Mode. Here driver runs outside the cluster.
Cluster Mode
In Cluster mode, driver will be running in one of the worker nodes within the cluster. Even if the gateway node crashes or even when we logout from the gateway node, the application will still run.
To Summarize,
When we invoke a job, the request goes to Resource Manager. The Resource Manager creates an Application Master on one of the worker nodes which will manage the application. This Application Master will request for more resources because it might require more resources on various worker nodes. It will co-ordinate with the Name Node to get to know where the file is stored. Based on that, it will request for resources so that we work on principle of data locality. It will request Resource Manager to get the resources. Resource Manager will provide the containers and executors on those nodes and Node Manager will manage the resources.
In Spark, Application Master can be considered as a Driver. THe Driver can run inside the cluster or outside the cluster. When it runs oustide the cluster, it is called as Client mode. When it runs inside the cluster, it is called as Cluster mode.
Credits - Sumit Mittal sir