Spark Architecture 10000 feet view
This page will give you a high-level understanding of spark architecture and spark components like CPU, Spark Contest, and cluster manager, etc. at end of this page I added the interview questions collected from various websites I hope this page will help you to understand the spark architecture and your interview preparation for architecture-related questions, before looking into directly the spark architecture I will suggest understanding the RDD and DAG this two are the spark abstracts links are at below.
RDD: https://www.dhirubhai.net/pulse/understanding-resilient-distributed-datasets-rdds-saikrishna-cheruvu-/
DAG:https://www.dhirubhai.net/pulse/understanding-directed-acyclic-graph-dag-saikrishna-cheruvu-/
What is Spark?
- lightning-fast real-time process framework, written in Scala (JVM)
- for the python interface, the py4j library is used to execute the Scala code.
- in-memory computations, Lazy execution, and parallel processing.
- Hadoop MapReduce was performing batch processing only and lacked a real-timing execution and spark does both batch and real-time executions.
- it leverages Hadoop for both storage and processing. it uses HDFS for storage and it can run Spark application on YARN as well.
- Spark can load data directly from disk, memory, and other data storage like Amazon S3, Azure, Hadoop, Cassandra, or other RDBMS Databases.
Before jumping directly to the spark architecture just understanding hardware components and execution modules.
What is CPU?
Below one is the simple diagram for the data flow of CPU in Spark. (hard driver >> RAM >> Processer)
We can see the hard drive, RAM, CPU (i5,i7, etc..) this structure exists in every system.
Hard drivers are generally slow and low cost and non-volatile.
Ram is generally fast and volatile in nature.
Normally if we give any operation to the processor it will process the data that is present in the hard disk. the process flow is like below data copied from Hard disk to RAM and CPU perform operation present in RAM it could be any operation read/write CPU can't perform any operation which is present in the hard disk. Data Should be in RAM.
It's a time-consuming process finding the data from the Hard disk by indexing it on the Hard drive.
in Spark Size of RAM is very important because data computations happen in memory only.
What is Spark Context?
Below one is the simple diagram for the data flow Spark Context on spark. (pyspark >> py4j >> JVM >> Python)
Spark Context(use case of distributed systems) :
Spark context is an entry point to Spark and defined in org. apache. spark package will helps to create the RDD, accumulators, and broadcast variables on the cluster.
Spark Context methods.
accumulator – It creates an accumulator variable of a given data type. Only a driver can access accumulator variables.
applicationId – Returns a unique ID of a Spark application.
appName – Return an app name that was given when creating SparkContext
broadcast – read-only variable broadcast to the entire cluster. You can broadcast a variable to a Spark cluster only once.
emptyRDD – Creates an empty RDD
getPersistentRDDs – Returns all persisted RDD’s
getOrCreate – Creates or returns a SparkContext
hadoopFile – Returns an RDD of a Hadoop file
master()– Returns master that set while creating SparkContext
newAPIHadoopFile – Creates an RDD for a Hadoop file with a new API InputFormat.
sequenceFile – Get an RDD for a Hadoop SequenceFile with given key and value types.
setLogLevel – Change log level to debug, info, warn, fatal and error
textFile – Reads a text file from HDFS, local or any Hadoop supported file systems and returns an RDD
union – Union two RDD
wholeTextFiles – Reads a text file in the folder from HDFS, local or any Hadoop supported file systems and returns an RDD of Tuple2. The first element of the tuple consists file name and the second element consists context of the text file.
other information about the Spark Context
The driver program's responsibility is to create the spark context.
in the distributed environment we need to negotiate the cluster manager with the required executor and memory.
to describe the above image I collected the below note from Wiki Page :
In the Python driver program, SparkContext uses py4j to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.
RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user’s code and the data to be processed.
How to open the spark context?
from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("local").setAppName("Spark Context Sample Sai") sc = SparkContext(conf = conf)
sc.__module__
it will result from the spark context: 'pyspark.context'
from pyspark import context context.__file__
it will return were the context.pyc file located on pyspark
few words about the py4j:
- Python interacts with JVM using something called reflection
- Before you set up the python side of the gateway, your JVM must be ready, i.e. Py4J does not launch the JVM for you
- To communicate with the Java process you need two things, a gateway server and an entry_point
Spark Context perform below oprations
- Getting the current status of spark application
- Canceling the job
- Canceling the Stage
- Running job synchronously
- Running job asynchronously
- Accessing persistent RDD
- Unpersisting RDD
- Programmable dynamic allocation
Spark Architecture
The master node having a driver program, which drives spark application. normal pyspark code with transformations/actions etc .. the code behaves as a driver program or if you are using the interactive shell, the shell acts as the driver program.
the driver program creates a Spark Context. Assume that the Spark context is a gateway to all the Spark functionalities. It is similar to your database connection. Any command you execute in your database goes through the database connection. Likewise, anything you do on Spark goes through Spark context.
Spark context deals with the cluster manager to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster.
A job is split into multiple tasks which are distributed over the worker node. Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.
Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then executed on the partitioned RDDs in the worker node and hence return back the result to the Spark Context.
Spark Context takes the job, breaks the job into tasks, and distributes them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context. If you increase the number of workers, then you can divide jobs into more partitions and execute them parallelly over multiple systems. It will be a lot faster. With the increase in the number of workers, memory size will also increase & you can cache the jobs to execute them faster.
Spark architecture interview questions and answers.
Q) What are the different cluster managers available in Apache Spark?
Answer) Standalone Mode: By default, applications submitted to the standalone mode cluster will run in FIFO order, and each application will try to use all available nodes. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
Apache Mesos: Apache Mesos is an open-source project to manage computer clusters, and can also run Hadoop applications. The advantages of deploying Spark with Mesos include dynamic partitioning between Spark and other frameworks as well as scalable partitioning between multiple instances of Spark.
Hadoop YARN: Apache YARN is the cluster resource manager of Hadoop 2. Spark can be run on YARN as well.
Kubernetes: Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.
Q) What is a lazy evaluation in Spark?
Answer) When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.
Q) What are the various functionalities supported by Spark Core?
Answer) Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:
Scheduling and monitoring jobs
Memory management
Fault recovery
Task dispatching
Q) How do you convert a Spark RDD into a DataFrame?
Answer) There are 2 ways to convert a Spark RDD into a DataFrame:
Using the helper function - toDF
import com.mapr.db.spark.sql._
val df = sc.loadFromMapRDB(<table-name>)
.where(field(“first_name”) === “Peter”)
.select(“_id”, “first_name”).toDF()
Using SparkSession.createDataFrame
You can convert an RDD[Row] to a DataFrame by
calling createDataFrame on a SparkSession object
def createDataFrame(RDD, schema:StructType)
Q) Explain the types of operations supported by RDDs.
Answer) RDDs support 2 types of operation:
Transformations: Transformations are operations that are performed on an RDD to create a new RDD containing the results (Example: map, filter, join, union)
Actions: Actions are operations that return a value after running a computation on an RDD (Example: reduce, first, count)
Q) What is a Lineage Graph?
Answer) A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.
The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.
Q) Does Apache Spark provide checkpoints?
Answer) Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.
Q) What is Executor Memory in a Spark application?
Answer) Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.
Q) What is Spark Driver?
Answer) Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
Q) What file systems does Spark support?
Answer) The following three file systems are supported by Spark:
Hadoop Distributed File System (HDFS).
Local File system.
Amazon S3
Q) What is Spark Executor?
Answer) When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
Q) When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Answer) Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
Q) What do you understand by SchemaRDD in Apache Spark RDD?
Answer) SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.
SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core module. The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema. On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.
Thanks much!
?if any mistakes are found in this article, please comment below I will edit the page.
ERP Functional & Implementation Consultant | Clicklearn, IFS & QA Specialist
3 年Awesome Well detailed :)
Azure Data Engineer@TechM | Ex HCL | Ex Cognizant | PySpark | HDFS | Hive | Spark | Python | Sqoop | Azure Cloud | SQL | Hadoop Ecosystem
3 年Thanks
AI/ML Engineer | Data Science | Machine Learning | GCP | Vertex AI | MLOps | Generative AI | NLP | Python | (Immediate Joiner)
3 年Great article Saikrishna Cheruvu