Anatomy of Apache Spark's RDD
Deepak Rajak
Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.
Every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
Resilient Distributed Datasets (RDDs) are the fundamental object in Apache Spark. RDDs are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery. By nature, RDDs create new RDDs upon any operation such as transformation. They also store the lineage, which is used to recover from failures.
The following is a simple example of the RDD lineage:
RDD Fundamentals
Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure. Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. It possesses self-recovery in the case of failure.
Features of RDD
In-memory computation
The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the performance by an order of magnitudes.
Lazy Evaluation
The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered. Thus, it limits how much work it has to do.
Fault Tolerance
Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one. Thus, we can easily recover the lost data.
Immutability
RDDs are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any transformation, it creates new RDD. We achieve consistency through immutability.
Persistence
We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk, this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in memory by calling persist() or cache() function.
Partition
RDD partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing and internally it has no division. Thus, it provides parallelism.
Parallel
Rdd, process the data parallelly over the cluster.
Location-Stickiness
RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of RDD. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus, speed up computation.
Coarse-grained Operation
We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual element in the data set of RDD.
Typed
We can have RDD of various types like: RDD [int], RDD [long], RDD [string].
No limitation
we can have any number of RDD. there is no limit to its number. the limit depends on the size of disk and memory.
Types of RDDs
By nature, RDDs create new RDDs. Basically there are following types of RDDs for example.
Every RDD knows the following 5 things about itself and that's the core concept behind it.
- List of partitions
- List of dependencies
- Function to compute a partition from it parents
- Location
- Partitioning Information ( Key Value Pairs )
Lets look at some RDD Types:
HadoopRDD:
FilterRDD:
JoinedRDD:
Now that we have understood about RDDs. Let's try to answer some of most common questions.
How many ways a RDD can be created ?
Basically a RDD can be created in multiple ways but some of the common ways are:
- Parallelising the Collection
- From another RDD
- From an External Data Source ( HDFS / S3 / Azure Blob)
- From a Dataframe / Dataset ( df.rdd / ds.rdd )
Below is the example:
How can we create an empty RDD
What is the difference between textFile and wholeTextFile
textFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]
wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String, String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the file.
What is the RDD operator graph and how it is different from rdd lineage graph
Rdd operator graph is the lineage graph which tells the lineage of an rdd and by which spark achieves fault tolerance. It’s also called as rdd dependency graph
How can we read multiple files and create the single RDD.
We can create the RDD from multiple files in the same or different directories.
Conclusion:
So in this short article, we have covered the whereabouts of RDDs. We have understood what exactly the RDD? How many kind RDDs ? and tried to find out some of the most basic questions around it. Though 99.9 % times, you won't need to operate on RDDs but we need to understand that under the hood of a DataFrame or Dataset , the basic abstraction is RDD. As a spark developer our focus will always be on DataFrame only because it allows the Catalyst and Tungsten to optimise of our code and provide lighting fast performance.
Thanks for reading it and I hope you have found yourself informed. Please free to have any questions, I will be happy to answer them. If you liked the contents, please click the "like" button.
Day Trader at Self Employed
3 年Can any one help me to solve below issue from databricks spark I have date column in below format and in string type from source ,i want convert it in to date type and normal date format.Given type conversion syntax is not supporting this date format or i am missing something dont know.Can anyone please suggest how to convert it to normal any date format. Date Jul 22, 2021 Jul 20, 2021 Jul 19, 2021 Jul 16, 2021 Jul 15, 2021 Jul 14, 2021 Jul 13, 2021 Jul 12, 2021 Jul 09, 2021 Jul 08, 2021 Jul 07, 2021 Jul 06, 2021 Jul 05, 2021 Jul 02, 2021 Jul 01, 2021
Data Engineer @ The Henry M. Jackson Foundation | Azure, AWS, PySpark, Health Data
3 年Great article
Technical Delivery Leader | Data And AI Professional | CoE Thought Leader | Building Delivery Team | Contributor At Quora
4 年Nice article Deepak. Thanks for sharing it. However, there is a small mistake in the below line from your article: By nature, RDDs create new RDDs upon any operation such as transformation or action. This is not entirely true. When the action is triggered, new RDD is not formed like transformation. In fact, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system.
Big Data Analyst and Technical Lead at UOB Bank
4 年Blown up the extensive deep info of RDD’s...Thanks a lot to you Deepak.