登录查看更多内容

Introduction to Apache spark

NIKHIL G R

Serving Notice Period, Cloud Data Engineer at TCS, 2x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, ADLs, Azure Synapse, Azure Data factory, MySQL, Lake House, Delta Lake, Data Enthusiast

发布日期: 2023年11月16日

Apache Spark is a Distributed Computing Framework.

Before going into Apache Spark let us understand what are the challenges with the Mapreduce and why Apache Spark came into existence.

Challenges of Mapreduce

It is very hard to write the code in mapreduce.
It supports only Batch processing. To use streaming, we need to use other integrations.
There will be lot of disk I/O's which leads to performance degradations.
We have to learn lot of other frameworks such as Hive, Pig, Sqoop etc.
It does consists of only Map and Reduce.
There is no Interactive mode present.

What is Apache Spark?

Apache Spark is a plug and play compute engine. Spark does not come up with storage and Resource Manager.

We can plug it with the storage of our choice such as with HDFS, Amazon S3, ADLS Gen2, Google Cloud Storage, Local storage etc.
We can plug it with the Resource Manager of our choice such as with YARN, Mesos, Kubernetes etc.

Everyone get confused that Spark is an alternative for Hadoop. But it is wrong. Apache Spark is an alternative for Mapreduce in Hadoop ecosystem.

It is a General Purpose, In-Memory, Compute Engine.

General Purpose

There is no need to seperate code which has to be written for batch processing and streaming. We can use SQL style, Dataframe style etc. We can do querying, streaming and cleaning without other tools.

In-Memory

In Mapreduce, if we have 10 mapreduce jobs, there will be 20 disk I/O's i.e Mapreduce1 will read input from HDFS and give output to HDFS. Likewise Mapreduce2 will take input as the output of Mapreduce1 job from HDFS and process the output and so on.
But in Spark, it could be as less as 2 Disk I/O's. Computation happens In-Memory and then it is written to the disk.

Compute Engine

It is mainly used for the computation or processing of the data which is distributed.

Apache Spark vs Databricks

Databricks is a company and they have a product called as Databricks.

It is Spark internally but have extra features such as:

Provide Apache Spark on Cloud - AWS/ GCP/ AZURE
Provide Optimized Spark environment.
Provide Cluster Management.
Support Delta Lake architecture.
One can colloborate the notebooks.
Provide Implemented Security.

Apache Spark provides mainly two types of API's

Spark Core API's - We work at the RDD level
Higher Level API's - We can write code in Data Frames, Spark SQL, Spark Table style. It also supports Structured Streaming, MLlib, GraphX

RDD's

RDD stands for Resilientt Distributed Dataset.
It is a basic unit which holds the data in Apache Spark.
RDD is resilient to failures. We can quickly regenerate the RDD using the parent RDD.
RDD's are immutable. We cannot make changes to an existing RDD. We end up creating new RDD always for a transformation.

Omar Khaled 1 个月前

AWS EMR (Amazon Elastic MapReduce)

Rohit Singh 1 个月前

Apache Spark

Dhiraj Patra 1 年前

Drawbacks of RDD

It does not consist of any schema. It is just a raw data which is distributed across various partitions.
It is not persistent i.e, it can only be seen for a session. If we close the session, we won't be able to see it. It is temporary.

Dataframes

A distributed collection of data grouped into named columns.
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.

Spark SQL

Spark SQL is a Spark module for structured data processing.
Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.

Spark Table

Spark table is something which is persistent.
It is accessible across other Session.

Transformations, Actions and Utility Functions

Transformations - These are the operations which are applied on RDD to create a new RDD.

There are two types of Transformations:

Narrow Transformations - These transformations are the result of functions where the computed data lives on a single partition. There won't be any shuffling of data which would take place. There won't be any data movement to execute a Narrow Transformation, since the data reside on only single partition.eg: map(), filter()
Wide Transformations - These transformations are the result of functions where the computed data lives on multiple partitions. There will be shuffling of data which will take place. There will be a data movement to execute a Wide Transformation, since the data reside on multiple partitions.eg: groupByKey(), reduceByKey()

Actions - These are the operations which are applied on RDD, that instructs the Apache Spark to perform computation and send the result back to driver.

eg: reduce(), count()

Utility Functions - These are the builtin functions available for operations.

eg: cache(), printSchema()

Why Transformations are Lazy?

Consider we have a file file1 in HDFS which is having more than 10 billion records and we perform certain operations.

rdd1 = load file1 from hdfs
print first line from the above rdd1

If the transformation was not lazy, there would have been 10 billion records loaded into memory just to display a single record.

SparkSession is an entry point for any Spark program to execute. Before Spark2, we had

spark context
hive context
sql context

But after Spark2, it has been bundled as an umbrella as SparkSession.

Credit: Sumit Mittal sir

Introduction to Apache spark

NIKHIL G R

Serving Notice Period, Cloud Data Engineer at TCS, 2x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, ADLs, Azure Synapse, Azure Data factory, MySQL, Lake House, Delta Lake, Data Enthusiast

Challenges of Mapreduce

What is Apache Spark?

General Purpose

In-Memory

Compute Engine

Apache Spark vs Databricks

RDD's

领英推荐

Drawbacks of RDD

Dataframes

Spark SQL

Spark Table

Transformations, Actions and Utility Functions

Why Transformations are Lazy?

更多精彩文章

社区洞察

其他会员也浏览了

WAT IS SPARK

Optimizing Performance with MongoDB in Dockerized FastAPI Applications: Understanding the Strategy Behind Non-Dockerized Databases

Apache Spark

WHAT IS SPARK

ScyllaDB: Meet the Better Cassandra

Apache Spark Vs Hadoop

WHAT IS SPARK

What is Apache Spark?

What is Apache Spark ?

Spark on Kubernetes, A Practitioner’s Guide

Challenges of Mapreduce

What is Apache Spark?

General Purpose

In-Memory

Compute Engine

Apache Spark vs Databricks

RDD's

领英推荐

Drawbacks of RDD

Dataframes

Spark SQL

Spark Table

Transformations, Actions and Utility Functions

Why Transformations are Lazy?

Introduction to DBT (Data Build Tool)

2024年5月20日

DIFFERENCES IN SQL

2024年1月8日

Introduction to Azure Databricks (Part 2)

2023年12月6日

Introduction to Azure Databricks (Part 1)

2023年12月5日

Aggregate and Window Functions in Pyspark

2023年12月4日

Different ways of creating a Dataframe in Pyspark

2023年11月24日

Dataframes and Spark SQL Table

2023年11月23日

Dataframe Reader API

2023年11月22日

repartition vs coalesce in pyspark

2023年11月21日

Apache Spark on YARN Architecture

2023年11月16日

社区洞察

其他会员也浏览了

WAT IS SPARK

Optimizing Performance with MongoDB in Dockerized FastAPI Applications: Understanding the Strategy Behind Non-Dockerized Databases

Apache Spark

WHAT IS SPARK

ScyllaDB: Meet the Better Cassandra

Apache Spark Vs Hadoop

WHAT IS SPARK

What is Apache Spark?

What is Apache Spark ?

Spark on Kubernetes, A Practitioner’s Guide