登录查看更多内容

Start your Journey with Apache Spark?—?Part 1

Neeraj Bhadani

Principal Data Engineer | Machine Learning Scientist | Mentor | Blogger

发布日期: 2019年1月12日

Let’s begin our journey with Apache Spark. In this blog series, I will discuss the overview of Apache Spark. Apache Spark is a general purpose, In-memory computing engine. Spark can be used with Hadoop, YARN and other BIG Data components to harness the power of Spark and improve the performance of your applications. It provided various high-level APIs in Scala, Java, Python, R, and SQL.

Apache Spark works in a Master-slave architecture where Master is called as “Driver” and slaves are called as “workers”. Also, sc (Object of Spark Context Class) is the starting point of your spark Application which runs inside the driver.

Spark Architecture

Let’s try to understand the architecture of Apache Spark :

Apache Spark: Sometimes also called as Spark Core. Abstraction which is provided by Spark Core is RDD (Resilient Distributed Dataset) which is a collection of distributed data across the different nodes of the cluster which can be processed in parallel.
Spark SQL: Abstraction provided by this library is DataFrame, which is a relational representation of the data. It provides various functions with SQL like capabilities. Also, we can write SQL like queries for our data analysis.
Spark Streaming: Abstraction provided by this library is D-stream also called as Discretized Stream. This library provides the capabilities to process/transform the data in near real time as soon as you receive the data.
MLlib: This is Machine learning library which provides various commonly used algorithms collaborative filtering, classification, clustering and regression.
GraphX: This library helps us to process Graph and help us to solve various problems which can be solved using Graph Theory. Like Page Rank, Connected Components etc.

Let’s dig little deeper into Apache Spark or Spark Core now.

Abstraction which is provided by Spark core is RDD (Resilient Distributed Dataset). RDD is a collection of data which is distributed across the different nodes of the cluster so that it can be processed in parallel.

Note: I will use the Python API of Spark also called as pySpark for writing the code. However, Concepts will remain same for all the APIs.

Let’s create RDD

Create RDD based on Python collection.

Keywords = [‘Books’, ‘DVD’, ‘CD’, ‘PenDrive’]
key_rdd = sc.parallelize(keywords)

In the above code “Keywords” is python collection (List) and we are creating RDD from python collection using “parallelize” method of Spark Context Class.

Create RDD based on File

file_rdd = sc.textFile(“Path_to_File”)

There are two types of operations which can be performed on RDD :

Transformations: These operations work in a Lazy fashion. Which means when you apply the transformation on RDD they will not be evaluated immediately but they will only be stored in DAG (Directed Acyclic Graph) and will be evaluated at some later point of time when any action will be executed on top of that. Some common transformations are map, filter, flatMap, groupByKey, reduceByKey etc.
Actions: These operations will be executed immediately. Some common actions are first, last, count, collect etc.

Tip: RDDs are immutable in nature you cannot change the RDDs. However, you can transform one RDD to another using various Transformations.

Anatomy of Spark Job.

Application: When we submit the Spark code to cluster it creates the Spark Application.
Job: The Job is top level execution for any Spark application. Job is corresponding to any action in Spark application.
Stage: Jobs will be divided into stages. The transformations work in lazy fashion and will not be executed until action is called. Action might include one or many transformations and the wide transformations define the breakdown of jobs into stages, which is corresponds to a shuffle dependency.
Task: Stages will be further divided into various tasks. Task is the most granular unit in the execution in the Spark applications. Each task represents the local computation on a particular node in Spark Cluster.

Now we have a fair understanding of What is Spark, Spark Architecture, RDD and Anatomy of Spark Application, Let’s get our hands dirty with some hands-on exercise.

There are various ways like using Shell (Spark-shell or pyspark), Jupyter Notebooks, Zeppelin Notebooks, Spark-submit etc. to execute your Spark code.

Let’s create RDD and understand some basic transformations.

Create RDD from collection.

num = [1,2,3,4,5]
num_rdd = sc.parallelize(num)

Here num_rdd is an RDD based on python collection(list).

Transformations

As we know, transformations are lazy in nature and they will not be executed until any action will be executed on top of them. Let’s try to understand various available transformations.

map: This will map your input to some output based on the function specified in the map function.

We already have “num_rdd” created. Let’s try to double each number in RDD.

double_rdd = num_rdd.map(lambda x : x * 2)

Note: Expression specified inside the map function is another function without any name which is called as lambda function or anonymous function.

Filter: To filter the data based on certain condition. Let’s try to find the even numbers from num_rdd.

even = num_rdd.filter(lambda x : x % 2 == 0)

flatMap: This function is very similar to map, but can return multiple elements for each input in the given RDD.

flat_rdd = num_rdd.flatMap(lambda x : range(1,x))

This will return range object for each element in the input RDD (num_rdd).

Distinct: This will return distinct elements from an RDD.

rdd1 = sc.parallelize([10, 11, 10, 11, 12, 11])
dist_rdd = rdd1.distinct()

Above transformations are single-valued, where each element with-in RDD contains only one element. Let’s discuss some key-value pair RDD, where each element of RDD will be (key, value) pair.

reduceByKey: This function reduce the key values pairs based on the keys and given function inside the reduceByKey. Let’s try to understand this with an example.

pairs = [ (“a”, 5), (“b”, 7), (“c”, 2), (“a”, 3), (“b”, 1), (“c”, 4)]
pair_rdd = sc.parallelize(pairs)

pair_rdd is now key-value pair RDD.

output = pair_rdd.reduceByKey(lambda x, y : x + y)

output RDD will contain the pairs :

[ (“a”, 8), (“b”, 8), (“c”, 6) ]

Let’s try to understand the contents of output RDD here. We can think of reduceByKey operation in 2 steps.

It will collect all the values for a given key. So intermediate output will be as follows :

(“a”, <5,3>)
(“b”, <7, 1>)
(“c”, <2, 4>)

2. Now we have got all the values corresponding to a particular key. Now the “values” collection will be reduced or aggregated based on the function mentioned inside the reduceByKey. In our case its sum function, so we are getting sum of all the values for a given key. Hence the output is :

[ (“a”, 8), (“b”, 8), (“c”, 6) ]

groupByKey: This function is another ByKey function which can be operated on (key, value) pair RDD but this will only group the values based on the keys. In other words, this will only perform the first step of reduceByKey.

grp_out = pair_rdd.groupByKey()

sortByKey : This function will perform the sorting on (key, value) pair RDD based on the keys. By default, sorting will be done in ascending order.

pairs = [ (“a”, 5), (“d”, 7), (“c”, 2), (“b”, 3)]
raw_rdd = sc.parallelize(pairs)
sortkey_rdd = raw_rdd.sortByKey()

This will sort the pairs based on keys. So the output will be

[ (“a”, 5), (“b”, 3), (“c”, 2), (“d”, 7)]

Note: for sorting in descending order pass “ascending=False”.

sortBy: sortBy is a more generalized function for sorting.

num = [ (“a”, 5, 10), (“d”, 7, 12), (“c”, 2, 11), (“b”, 3, 9)]
raw_rdd = sc.parallelize(pairs)

Now we have got the RDD of tuples where each tuple has 3 elements in it. Let’s try to do the sorting based on 3rd element of the tuple.

sort_out = raw_rdd.sortBy(lambda x : x[2])

Note: for sorting in descending order pass “ascending=False”.

There are various other transformations which you can find here.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

Actions

Actions are operations on RDDs which will be executed immediately. Actions are fairly simple as compare to transformations. Transformations will return another RDD, however, Actions will take you out from the Spark world.

Count: This will count the number of elements in the given RDD.

num = sc.parallelize([1,2,3,4,2])
num.count() # output : 5

First: This will return the first element from given rdd.

num.first() # output : 1

Collect: This will return all the elements for the given RDD.

num.collect() # output : [1,2,3,4,2]

Note: We should not use collect operation while working with large datasets. Because it will return all the data which is distributed across the different workers of your cluster to a driver. So due to this operation, all the data will travel across the network from worker to driver and also driver needs to keep all the data. This will hamper the performance of your application.

Take: This will return the number of elements specified.

num.take(3) # output : [1, 2, 3]

There are various other actions which you can find here.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

Durgesh Patel

Drive initiatives focused on Customer Success, Cloud &HPC, Adoption and Growth | Love helping others

6 年

Very informative and nicely written post....Keep the good work going and thank you for sharing !!

查看更多评论

要查看或添加评论，请登录

Neeraj Bhadani的更多文章

Start your Journey with Apache Spark?-?Part?3

2019年1月26日

Start your Journey with Apache Spark?-?Part?3

This is Part-3 of the series “Start your Journey with Apache Spark”. Please read Part-1 and Part-2 if you haven’t got a…
Start your Journey with Apache Spark?-?Part?2

2019年1月21日

Start your Journey with Apache Spark?-?Part?2

This is Part-2 for blog series “Start your Journey with Apache Spark”. I would recommend to please read the Part?-1 if…
Getting Started with Serverless on Google Cloud Platform (GCP)

2019年1月12日

Getting Started with Serverless on Google Cloud Platform (GCP)

Serverless helps you to develop and deploy your code without worrying about underlying infrastructure since you don’t…
IBM India Referral Hiring Drive

2015年8月27日

IBM India Referral Hiring Drive

IBM India Employee Referral requirements for 3 to 15 yrs of experience at various locations for various skills listed…
IBM India Referral hiring drive

2015年7月7日

IBM India Referral hiring drive

IBM India Referral hiring drive at various locations for Skill set listed below with 3 to 15 years of experience,If…

See all articles

Start your Journey with Apache Spark?—?Part 1

Neeraj Bhadani

Principal Data Engineer | Machine Learning Scientist | Mentor | Blogger

Transformations

Actions

Neeraj Bhadani的更多文章

社区洞察

其他会员也浏览了

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

How to use PySpark on your computer

Spark Tidbits - Lesson 10

Python POST requests three ways with Oracle REST Data Services (ORDS)

PySpark: INTRODUCTION

Spark Scala

Practical Apache Spark in 10 minutes. Part 2 - RDD

Understanding of Apache Kafka's architecture, its seamless integration with Python: By Fidel Vetino

Practical Apache Spark in 10 minutes. Part 1?—?Ubuntu installation

Apache Spark: RDD, Dataframe or Dataset, which one should I choose?

Transformations

Actions

Neeraj Bhadani的更多文章

Start your Journey with Apache Spark?-?Part?3

Start your Journey with Apache Spark?-?Part?2

Getting Started with Serverless on Google Cloud Platform (GCP)

IBM India Referral Hiring Drive

IBM India Referral hiring drive

社区洞察

其他会员也浏览了

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

How to use PySpark on your computer

Spark Tidbits - Lesson 10

Python POST requests three ways with Oracle REST Data Services (ORDS)

PySpark: INTRODUCTION

Spark Scala

Practical Apache Spark in 10 minutes. Part 2 - RDD

Understanding of Apache Kafka's architecture, its seamless integration with Python: By Fidel Vetino

Practical Apache Spark in 10 minutes. Part 1?—?Ubuntu installation

Apache Spark: RDD, Dataframe or Dataset, which one should I choose?