登录查看更多内容

Spark Architecture 10000 feet view

Saikrishna Cheruvu

Lead Developer | Data Engineer | MLOPS | ex@ BOFA

发布日期: 2021年4月7日

This page will give you a high-level understanding of spark architecture and spark components like CPU, Spark Contest, and cluster manager, etc. at end of this page I added the interview questions collected from various websites I hope this page will help you to understand the spark architecture and your interview preparation for architecture-related questions, before looking into directly the spark architecture I will suggest understanding the RDD and DAG this two are the spark abstracts links are at below.

RDD: https://www.dhirubhai.net/pulse/understanding-resilient-distributed-datasets-rdds-saikrishna-cheruvu-/

DAG:https://www.dhirubhai.net/pulse/understanding-directed-acyclic-graph-dag-saikrishna-cheruvu-/

What is Spark?

lightning-fast real-time process framework, written in Scala (JVM)
for the python interface, the py4j library is used to execute the Scala code.
in-memory computations, Lazy execution, and parallel processing.
Hadoop MapReduce was performing batch processing only and lacked a real-timing execution and spark does both batch and real-time executions.
it leverages Hadoop for both storage and processing. it uses HDFS for storage and it can run Spark application on YARN as well.
Spark can load data directly from disk, memory, and other data storage like Amazon S3, Azure, Hadoop, Cassandra, or other RDBMS Databases.

Before jumping directly to the spark architecture just understanding hardware components and execution modules.

What is CPU?

Below one is the simple diagram for the data flow of CPU in Spark. (hard driver >> RAM >> Processer)

We can see the hard drive, RAM, CPU (i5,i7, etc..) this structure exists in every system.

Hard drivers are generally slow and low cost and non-volatile.

Ram is generally fast and volatile in nature.

Normally if we give any operation to the processor it will process the data that is present in the hard disk. the process flow is like below data copied from Hard disk to RAM and CPU perform operation present in RAM it could be any operation read/write CPU can't perform any operation which is present in the hard disk. Data Should be in RAM.

It's a time-consuming process finding the data from the Hard disk by indexing it on the Hard drive.

in Spark Size of RAM is very important because data computations happen in memory only.

What is Spark Context?

Below one is the simple diagram for the data flow Spark Context on spark. (pyspark >> py4j >> JVM >> Python)

Spark Context(use case of distributed systems) :

Spark context is an entry point to Spark and defined in org. apache. spark package will helps to create the RDD, accumulators, and broadcast variables on the cluster.

Spark Context methods.

accumulator – It creates an accumulator variable of a given data type. Only a driver can access accumulator variables.

applicationId – Returns a unique ID of a Spark application.

appName – Return an app name that was given when creating SparkContext

broadcast – read-only variable broadcast to the entire cluster. You can broadcast a variable to a Spark cluster only once.

emptyRDD – Creates an empty RDD

getPersistentRDDs – Returns all persisted RDD’s

getOrCreate – Creates or returns a SparkContext

hadoopFile – Returns an RDD of a Hadoop file

master()– Returns master that set while creating SparkContext

newAPIHadoopFile – Creates an RDD for a Hadoop file with a new API InputFormat.

sequenceFile – Get an RDD for a Hadoop SequenceFile with given key and value types.

setLogLevel – Change log level to debug, info, warn, fatal and error

textFile – Reads a text file from HDFS, local or any Hadoop supported file systems and returns an RDD

union – Union two RDD

wholeTextFiles – Reads a text file in the folder from HDFS, local or any Hadoop supported file systems and returns an RDD of Tuple2. The first element of the tuple consists file name and the second element consists context of the text file.

other information about the Spark Context

The driver program's responsibility is to create the spark context.

in the distributed environment we need to negotiate the cluster manager with the required executor and memory.

to describe the above image I collected the below note from Wiki Page :

In the Python driver program, SparkContext uses py4j to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user’s code and the data to be processed.

How to open the spark context?

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("Spark Context Sample Sai")
sc = SparkContext(conf = conf)

sc.__module__

it will result from the spark context: 'pyspark.context'

from pyspark import context
context.__file__

it will return were the context.pyc file located on pyspark

few words about the py4j:

Python interacts with JVM using something called reflection
Before you set up the python side of the gateway, your JVM must be ready, i.e. Py4J does not launch the JVM for you
To communicate with the Java process you need two things, a gateway server and an entry_point

Spark Context perform below oprations

Getting the current status of spark application
Canceling the job
Canceling the Stage
Running job synchronously
Running job asynchronously
Accessing persistent RDD
Unpersisting RDD
Programmable dynamic allocation

Spark Architecture

The master node having a driver program, which drives spark application. normal pyspark code with transformations/actions etc .. the code behaves as a driver program or if you are using the interactive shell, the shell acts as the driver program.

the driver program creates a Spark Context. Assume that the Spark context is a gateway to all the Spark functionalities. It is similar to your database connection. Any command you execute in your database goes through the database connection. Likewise, anything you do on Spark goes through Spark context.

Spark context deals with the cluster manager to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster.

A job is split into multiple tasks which are distributed over the worker node. Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.

Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then executed on the partitioned RDDs in the worker node and hence return back the result to the Spark Context.

Spark Context takes the job, breaks the job into tasks, and distributes them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context. If you increase the number of workers, then you can divide jobs into more partitions and execute them parallelly over multiple systems. It will be a lot faster. With the increase in the number of workers, memory size will also increase & you can cache the jobs to execute them faster.

Spark architecture interview questions and answers.

Q) What are the different cluster managers available in Apache Spark?

Answer) Standalone Mode: By default, applications submitted to the standalone mode cluster will run in FIFO order, and each application will try to use all available nodes. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.

Apache Mesos: Apache Mesos is an open-source project to manage computer clusters, and can also run Hadoop applications. The advantages of deploying Spark with Mesos include dynamic partitioning between Spark and other frameworks as well as scalable partitioning between multiple instances of Spark.

Hadoop YARN: Apache YARN is the cluster resource manager of Hadoop 2. Spark can be run on YARN as well.

Kubernetes: Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.

Q) What is a lazy evaluation in Spark?

Answer) When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.

Q) What are the various functionalities supported by Spark Core?

Answer) Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:

Scheduling and monitoring jobs

Memory management

Fault recovery

Task dispatching

Q) How do you convert a Spark RDD into a DataFrame?

Answer) There are 2 ways to convert a Spark RDD into a DataFrame:

Using the helper function - toDF

import com.mapr.db.spark.sql._

val df = sc.loadFromMapRDB(<table-name>)

.where(field(“first_name”) === “Peter”)

.select(“_id”, “first_name”).toDF()

Using SparkSession.createDataFrame

You can convert an RDD[Row] to a DataFrame by

calling createDataFrame on a SparkSession object

def createDataFrame(RDD, schema:StructType)

Q) Explain the types of operations supported by RDDs.

Answer) RDDs support 2 types of operation:

Transformations: Transformations are operations that are performed on an RDD to create a new RDD containing the results (Example: map, filter, join, union)

Actions: Actions are operations that return a value after running a computation on an RDD (Example: reduce, first, count)

Q) What is a Lineage Graph?

Answer) A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.

The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.

Q) Does Apache Spark provide checkpoints?

Answer) Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.

There are 2 types of data for which we can use checkpointing in Spark.

Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.

Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.

Q) What is Executor Memory in a Spark application?

Answer) Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

Q) What is Spark Driver?

Answer) Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.

The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

Q) What file systems does Spark support?

Answer) The following three file systems are supported by Spark:

Hadoop Distributed File System (HDFS).

Local File system.

Amazon S3

Q) What is Spark Executor?

Answer) When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

Q) When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Answer) Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Q) What do you understand by SchemaRDD in Apache Spark RDD?

Answer) SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.

SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core module. The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema. On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.

Thanks much!

?if any mistakes are found in this article, please comment below I will edit the page.

Kumar Madduri MVNK

ERP Functional & Implementation Consultant | Clicklearn, IFS & QA Specialist

3 年

Awesome Well detailed :)

Sundarraj T

3 年

Thanks

Uma Maheswara Rao Ealuri

3 年

Great article Saikrishna Cheruvu

查看更多评论

要查看或添加评论，请登录

Saikrishna Cheruvu的更多文章

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

2024年8月4日

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

In recent years, the landscape of Business Intelligence (BI) has witnessed significant transformations. One of the most…
"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

2024年6月30日

"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

I am trying to attempt a comparison between dbt and Databricks (delta live tables) Note: Not prompted and copied from…

3 条评论
Problems with scalable data systems need creative approaches.

2024年4月7日

Problems with scalable data systems need creative approaches.

Maybe chatGpt will help to write the code, not the solutions that we need to do with human intelligence. (?? soon the…

3 条评论
Datasbricks vs Snowflake ??part 1??

2023年8月19日

Datasbricks vs Snowflake ??part 1??

Snowflake and Databricks have wonderful features and most of them are common. If any feature is released on one of the…

4 条评论
What is Z-Order on Databricks?

2023年4月1日

What is Z-Order on Databricks?

What is Z-Order? We can compare the z-order with the cluster index in Oracle (I am a fan of SQL and databases, so my…
SQL Statement Execution API by Databricks

2023年3月9日

SQL Statement Execution API by Databricks

Recently, Databricks released an API for the execution of SQL statements. as of now, this is available on AWS and Azure…

2 条评论
What is Data Mesh?

2022年11月2日

What is Data Mesh?

What is a data mesh? Data mesh is not a technology; it is a conceptual theory of what types of applications we can…

3 条评论
Enterprise Scale Analytics/AI

2022年10月31日

Enterprise Scale Analytics/AI

few lines on ESA Enterprise scale is an architecture approach and reference implementation that enables effective…
Data bricks Governance and Security(Data masking) Implementation with example

2022年10月19日

Data bricks Governance and Security(Data masking) Implementation with example

Some lines about Data masking: Data masking is a technique for creating a dummy data (fake) but realistic version of…

2 条评论
Building Python SDK for Databricks REST API

2022年10月17日

Building Python SDK for Databricks REST API

This article is about a project I've started to work on lately. Please welcome Databricsk REST API - Python.

See all articles

Spark Architecture 10000 feet view

Saikrishna Cheruvu

Lead Developer | Data Engineer | MLOPS | ex@ BOFA

What is Spark?

What is CPU?

What is Spark Context?

Saikrishna Cheruvu的更多文章

社区洞察

其他会员也浏览了

How to implement Apache Spark in Data Processing and Analytics?

Apache Spark: The Ultimate Big Data Processing Engine

MapReduce Program Paradigm

Basic Spark Utilization for Analytics in Big Data

Apache Spark

About Apache Spark, Lightning-fast cluster computing (Big Data)

Massive Dataset Processing: The Power of MapReduce

Apache Spark?-?Data Engineering

IBM presents Predictive Analytics

What is Spark?

What is CPU?

What is Spark Context?

Saikrishna Cheruvu的更多文章

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

Problems with scalable data systems need creative approaches.

Datasbricks vs Snowflake ??part 1??

What is Z-Order on Databricks?

SQL Statement Execution API by Databricks

What is Data Mesh?

Enterprise Scale Analytics/AI

Data bricks Governance and Security(Data masking) Implementation with example

Building Python SDK for Databricks REST API

社区洞察

其他会员也浏览了

How to implement Apache Spark in Data Processing and Analytics?

Apache Spark: The Ultimate Big Data Processing Engine

MapReduce Program Paradigm

Basic Spark Utilization for Analytics in Big Data

Apache Spark

About Apache Spark, Lightning-fast cluster computing (Big Data)

Massive Dataset Processing: The Power of MapReduce

Apache Spark?-?Data Engineering

IBM presents Predictive Analytics