Anatomy of Apache Spark's RDD

Anatomy of Apache Spark's RDD

Every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. 

Resilient Distributed Datasets (RDDs) are the fundamental object in Apache Spark. RDDs are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery. By nature, RDDs create new RDDs upon any operation such as transformation. They also store the lineage, which is used to recover from failures.

The following is a simple example of the RDD lineage:

No alt text provided for this image


RDD Fundamentals

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed, since Data resides on multiple nodes.

Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure. Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. It possesses self-recovery in the case of failure.

Features of RDD

In-memory computation

The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the performance by an order of magnitudes. 

Lazy Evaluation

The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered. Thus, it limits how much work it has to do. 

Fault Tolerance

Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one. Thus, we can easily recover the lost data. 

Immutability

RDDs are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any transformation, it creates new RDD. We achieve consistency through immutability.

Persistence

We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk, this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in memory by calling persist() or cache() function. 

Partition

RDD partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing and internally it has no division. Thus, it provides parallelism.

Parallel 

Rdd, process the data parallelly over the cluster.

Location-Stickiness

RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of RDD. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus, speed up computation. 

Coarse-grained Operation

We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual element in the data set of RDD.

Typed

We can have RDD of various types like: RDD [int], RDD [long], RDD [string].

No limitation

we can have any number of RDD. there is no limit to its number. the limit depends on the size of disk and memory.

Types of RDDs

By nature, RDDs create new RDDs. Basically there are following types of RDDs for example.

No alt text provided for this image

Every RDD knows the following 5 things about itself and that's the core concept behind it.

  1. List of partitions
  2. List of dependencies
  3. Function to compute a partition from it parents
  4. Location
  5. Partitioning Information ( Key Value Pairs )
No alt text provided for this image


Lets look at some RDD Types:

HadoopRDD:

No alt text provided for this image

FilterRDD:

No alt text provided for this image

JoinedRDD:

No alt text provided for this image

Now that we have understood about RDDs. Let's try to answer some of most common questions.

How many ways a RDD can be created ?

Basically a RDD can be created in multiple ways but some of the common ways are:

  1. Parallelising the Collection
  2. From another RDD
  3. From an External Data Source ( HDFS / S3 / Azure Blob)
  4. From a Dataframe / Dataset ( df.rdd / ds.rdd )

Below is the example:

No alt text provided for this image

How can we create an empty RDD

No alt text provided for this image


What is the difference between textFile and wholeTextFile

textFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]

wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String, String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the file.

No alt text provided for this image


What is the RDD operator graph and how it is different from rdd lineage graph

Rdd operator graph is the lineage graph which tells the lineage of an rdd and by which spark achieves fault tolerance. It’s also called as rdd dependency graph 

How can we read multiple files and create the single RDD.

We can create the RDD from multiple files in the same or different directories.

No alt text provided for this image

Conclusion:

So in this short article, we have covered the whereabouts of RDDs. We have understood what exactly the RDD? How many kind RDDs ? and tried to find out some of the most basic questions around it. Though 99.9 % times, you won't need to operate on RDDs but we need to understand that under the hood of a DataFrame or Dataset , the basic abstraction is RDD. As a spark developer our focus will always be on DataFrame only because it allows the Catalyst and Tungsten to optimise of our code and provide lighting fast performance.

Thanks for reading it and I hope you have found yourself informed. Please free to have any questions, I will be happy to answer them. If you liked the contents, please click the "like" button.

KindaTrader .

Day Trader at Self Employed

3 年

Can any one help me to solve below issue from databricks spark I have date column in below format and in string type from source ,i want convert it in to date type and normal date format.Given type conversion syntax is not supporting this date format or i am missing something dont know.Can anyone please suggest how to convert it to normal any date format. Date Jul 22, 2021 Jul 20, 2021 Jul 19, 2021 Jul 16, 2021 Jul 15, 2021 Jul 14, 2021 Jul 13, 2021 Jul 12, 2021 Jul 09, 2021 Jul 08, 2021 Jul 07, 2021 Jul 06, 2021 Jul 05, 2021 Jul 02, 2021 Jul 01, 2021

回复
Dharmendra K.

Data Engineer @ The Henry M. Jackson Foundation | Azure, AWS, PySpark, Health Data

3 年

Great article

回复
Gaive Gandhi

Technical Delivery Leader | Data And AI Professional | CoE Thought Leader | Building Delivery Team | Contributor At Quora

4 年

Nice article Deepak. Thanks for sharing it. However, there is a small mistake in the below line from your article: By nature, RDDs create new RDDs upon any operation such as transformation or action. This is not entirely true. When the action is triggered, new RDD is not formed like transformation. In fact, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system.

Vijai Pasad

Big Data Analyst and Technical Lead at UOB Bank

4 年

Blown up the extensive deep info of RDD’s...Thanks a lot to you Deepak.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了