Introduction to Apache spark

Introduction to Apache spark

Apache Spark is a Distributed Computing Framework.

Before going into Apache Spark let us understand what are the challenges with the Mapreduce and why Apache Spark came into existence.

Challenges of Mapreduce

  • It is very hard to write the code in mapreduce.
  • It supports only Batch processing. To use streaming, we need to use other integrations.
  • There will be lot of disk I/O's which leads to performance degradations.
  • We have to learn lot of other frameworks such as Hive, Pig, Sqoop etc.
  • It does consists of only Map and Reduce.
  • There is no Interactive mode present.

What is Apache Spark?

Apache Spark is a plug and play compute engine. Spark does not come up with storage and Resource Manager.

  • We can plug it with the storage of our choice such as with HDFS, Amazon S3, ADLS Gen2, Google Cloud Storage, Local storage etc.
  • We can plug it with the Resource Manager of our choice such as with YARN, Mesos, Kubernetes etc.

Everyone get confused that Spark is an alternative for Hadoop. But it is wrong. Apache Spark is an alternative for Mapreduce in Hadoop ecosystem.

It is a General Purpose, In-Memory, Compute Engine.

General Purpose

  • There is no need to seperate code which has to be written for batch processing and streaming. We can use SQL style, Dataframe style etc. We can do querying, streaming and cleaning without other tools.

In-Memory

  • In Mapreduce, if we have 10 mapreduce jobs, there will be 20 disk I/O's i.e Mapreduce1 will read input from HDFS and give output to HDFS. Likewise Mapreduce2 will take input as the output of Mapreduce1 job from HDFS and process the output and so on.
  • But in Spark, it could be as less as 2 Disk I/O's. Computation happens In-Memory and then it is written to the disk.

Compute Engine

  • It is mainly used for the computation or processing of the data which is distributed.

Apache Spark vs Databricks

Databricks is a company and they have a product called as Databricks.

It is Spark internally but have extra features such as:

  • Provide Apache Spark on Cloud - AWS/ GCP/ AZURE
  • Provide Optimized Spark environment.
  • Provide Cluster Management.
  • Support Delta Lake architecture.
  • One can colloborate the notebooks.
  • Provide Implemented Security.

Apache Spark provides mainly two types of API's

  • Spark Core API's - We work at the RDD level
  • Higher Level API's - We can write code in Data Frames, Spark SQL, Spark Table style. It also supports Structured Streaming, MLlib, GraphX

RDD's

  • RDD stands for Resilientt Distributed Dataset.
  • It is a basic unit which holds the data in Apache Spark.
  • RDD is resilient to failures. We can quickly regenerate the RDD using the parent RDD.
  • RDD's are immutable. We cannot make changes to an existing RDD. We end up creating new RDD always for a transformation.

Drawbacks of RDD

  • It does not consist of any schema. It is just a raw data which is distributed across various partitions.
  • It is not persistent i.e, it can only be seen for a session. If we close the session, we won't be able to see it. It is temporary.

Dataframes

  • A distributed collection of data grouped into named columns.
  • A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.

Spark SQL

  • Spark SQL is a Spark module for structured data processing.
  • Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.

Spark Table

  • Spark table is something which is persistent.
  • It is accessible across other Session.

Transformations, Actions and Utility Functions

Transformations - These are the operations which are applied on RDD to create a new RDD.

There are two types of Transformations:

  • Narrow Transformations - These transformations are the result of functions where the computed data lives on a single partition. There won't be any shuffling of data which would take place. There won't be any data movement to execute a Narrow Transformation, since the data reside on only single partition.eg: map(), filter()
  • Wide Transformations - These transformations are the result of functions where the computed data lives on multiple partitions. There will be shuffling of data which will take place. There will be a data movement to execute a Wide Transformation, since the data reside on multiple partitions.eg: groupByKey(), reduceByKey()

Actions - These are the operations which are applied on RDD, that instructs the Apache Spark to perform computation and send the result back to driver.

eg: reduce(), count()

Utility Functions - These are the builtin functions available for operations.

eg: cache(), printSchema()

Why Transformations are Lazy?

Consider we have a file file1 in HDFS which is having more than 10 billion records and we perform certain operations.

  • rdd1 = load file1 from hdfs
  • print first line from the above rdd1

If the transformation was not lazy, there would have been 10 billion records loaded into memory just to display a single record.

SparkSession is an entry point for any Spark program to execute. Before Spark2, we had

  • spark context
  • hive context
  • sql context

But after Spark2, it has been bundled as an umbrella as SparkSession.


Credit: Sumit Mittal sir

要查看或添加评论,请登录

社区洞察

其他会员也浏览了