Start your Journey with Apache Spark?—?Part 1
Neeraj Bhadani
Principal Data Engineer | Machine Learning Scientist | Mentor | Blogger
Let’s begin our journey with Apache Spark. In this blog series, I will discuss the overview of Apache Spark. Apache Spark is a general purpose, In-memory computing engine. Spark can be used with Hadoop, YARN and other BIG Data components to harness the power of Spark and improve the performance of your applications. It provided various high-level APIs in Scala, Java, Python, R, and SQL.
Apache Spark works in a Master-slave architecture where Master is called as “Driver” and slaves are called as “workers”. Also, sc (Object of Spark Context Class) is the starting point of your spark Application which runs inside the driver.
Spark Architecture
Let’s try to understand the architecture of Apache Spark :
- Apache Spark: Sometimes also called as Spark Core. Abstraction which is provided by Spark Core is RDD (Resilient Distributed Dataset) which is a collection of distributed data across the different nodes of the cluster which can be processed in parallel.
- Spark SQL: Abstraction provided by this library is DataFrame, which is a relational representation of the data. It provides various functions with SQL like capabilities. Also, we can write SQL like queries for our data analysis.
- Spark Streaming: Abstraction provided by this library is D-stream also called as Discretized Stream. This library provides the capabilities to process/transform the data in near real time as soon as you receive the data.
- MLlib: This is Machine learning library which provides various commonly used algorithms collaborative filtering, classification, clustering and regression.
- GraphX: This library helps us to process Graph and help us to solve various problems which can be solved using Graph Theory. Like Page Rank, Connected Components etc.
Let’s dig little deeper into Apache Spark or Spark Core now.
Abstraction which is provided by Spark core is RDD (Resilient Distributed Dataset). RDD is a collection of data which is distributed across the different nodes of the cluster so that it can be processed in parallel.
Note: I will use the Python API of Spark also called as pySpark for writing the code. However, Concepts will remain same for all the APIs.
Let’s create RDD
- Create RDD based on Python collection.
Keywords = [‘Books’, ‘DVD’, ‘CD’, ‘PenDrive’] key_rdd = sc.parallelize(keywords)
In the above code “Keywords” is python collection (List) and we are creating RDD from python collection using “parallelize” method of Spark Context Class.
- Create RDD based on File
file_rdd = sc.textFile(“Path_to_File”)
There are two types of operations which can be performed on RDD :
- Transformations: These operations work in a Lazy fashion. Which means when you apply the transformation on RDD they will not be evaluated immediately but they will only be stored in DAG (Directed Acyclic Graph) and will be evaluated at some later point of time when any action will be executed on top of that. Some common transformations are map, filter, flatMap, groupByKey, reduceByKey etc.
- Actions: These operations will be executed immediately. Some common actions are first, last, count, collect etc.
Tip: RDDs are immutable in nature you cannot change the RDDs. However, you can transform one RDD to another using various Transformations.
Anatomy of Spark Job.
- Application: When we submit the Spark code to cluster it creates the Spark Application.
- Job: The Job is top level execution for any Spark application. Job is corresponding to any action in Spark application.
- Stage: Jobs will be divided into stages. The transformations work in lazy fashion and will not be executed until action is called. Action might include one or many transformations and the wide transformations define the breakdown of jobs into stages, which is corresponds to a shuffle dependency.
- Task: Stages will be further divided into various tasks. Task is the most granular unit in the execution in the Spark applications. Each task represents the local computation on a particular node in Spark Cluster.
Now we have a fair understanding of What is Spark, Spark Architecture, RDD and Anatomy of Spark Application, Let’s get our hands dirty with some hands-on exercise.
There are various ways like using Shell (Spark-shell or pyspark), Jupyter Notebooks, Zeppelin Notebooks, Spark-submit etc. to execute your Spark code.
Let’s create RDD and understand some basic transformations.
- Create RDD from collection.
num = [1,2,3,4,5] num_rdd = sc.parallelize(num)
Here num_rdd is an RDD based on python collection(list).
Transformations
As we know, transformations are lazy in nature and they will not be executed until any action will be executed on top of them. Let’s try to understand various available transformations.
- map: This will map your input to some output based on the function specified in the map function.
We already have “num_rdd” created. Let’s try to double each number in RDD.
double_rdd = num_rdd.map(lambda x : x * 2)
Note: Expression specified inside the map function is another function without any name which is called as lambda function or anonymous function.
- Filter: To filter the data based on certain condition. Let’s try to find the even numbers from num_rdd.
even = num_rdd.filter(lambda x : x % 2 == 0)
- flatMap: This function is very similar to map, but can return multiple elements for each input in the given RDD.
flat_rdd = num_rdd.flatMap(lambda x : range(1,x))
This will return range object for each element in the input RDD (num_rdd).
- Distinct: This will return distinct elements from an RDD.
rdd1 = sc.parallelize([10, 11, 10, 11, 12, 11]) dist_rdd = rdd1.distinct()
Above transformations are single-valued, where each element with-in RDD contains only one element. Let’s discuss some key-value pair RDD, where each element of RDD will be (key, value) pair.
- reduceByKey: This function reduce the key values pairs based on the keys and given function inside the reduceByKey. Let’s try to understand this with an example.
pairs = [ (“a”, 5), (“b”, 7), (“c”, 2), (“a”, 3), (“b”, 1), (“c”, 4)]
pair_rdd = sc.parallelize(pairs)
pair_rdd is now key-value pair RDD.
output = pair_rdd.reduceByKey(lambda x, y : x + y)
output RDD will contain the pairs :
[ (“a”, 8), (“b”, 8), (“c”, 6) ]
Let’s try to understand the contents of output RDD here. We can think of reduceByKey operation in 2 steps.
- It will collect all the values for a given key. So intermediate output will be as follows :
(“a”, <5,3>)
(“b”, <7, 1>)
(“c”, <2, 4>)
2. Now we have got all the values corresponding to a particular key. Now the “values” collection will be reduced or aggregated based on the function mentioned inside the reduceByKey. In our case its sum function, so we are getting sum of all the values for a given key. Hence the output is :
[ (“a”, 8), (“b”, 8), (“c”, 6) ]
- groupByKey: This function is another ByKey function which can be operated on (key, value) pair RDD but this will only group the values based on the keys. In other words, this will only perform the first step of reduceByKey.
grp_out = pair_rdd.groupByKey()
- sortByKey : This function will perform the sorting on (key, value) pair RDD based on the keys. By default, sorting will be done in ascending order.
pairs = [ (“a”, 5), (“d”, 7), (“c”, 2), (“b”, 3)]
raw_rdd = sc.parallelize(pairs)
sortkey_rdd = raw_rdd.sortByKey()
This will sort the pairs based on keys. So the output will be
[ (“a”, 5), (“b”, 3), (“c”, 2), (“d”, 7)]
Note: for sorting in descending order pass “ascending=False”.
- sortBy: sortBy is a more generalized function for sorting.
num = [ (“a”, 5, 10), (“d”, 7, 12), (“c”, 2, 11), (“b”, 3, 9)]
raw_rdd = sc.parallelize(pairs)
Now we have got the RDD of tuples where each tuple has 3 elements in it. Let’s try to do the sorting based on 3rd element of the tuple.
sort_out = raw_rdd.sortBy(lambda x : x[2])
Note: for sorting in descending order pass “ascending=False”.
There are various other transformations which you can find here.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
Actions
Actions are operations on RDDs which will be executed immediately. Actions are fairly simple as compare to transformations. Transformations will return another RDD, however, Actions will take you out from the Spark world.
- Count: This will count the number of elements in the given RDD.
num = sc.parallelize([1,2,3,4,2])
num.count() # output : 5
- First: This will return the first element from given rdd.
num.first() # output : 1
- Collect: This will return all the elements for the given RDD.
num.collect() # output : [1,2,3,4,2]
Note: We should not use collect operation while working with large datasets. Because it will return all the data which is distributed across the different workers of your cluster to a driver. So due to this operation, all the data will travel across the network from worker to driver and also driver needs to keep all the data. This will hamper the performance of your application.
- Take: This will return the number of elements specified.
num.take(3) # output : [1, 2, 3]
There are various other actions which you can find here.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions
Drive initiatives focused on Customer Success, Cloud &HPC, Adoption and Growth | Love helping others
6 年Very informative and nicely written post....Keep the good work going and thank you for sharing !!