登录查看更多内容

Anatomy of Apache Spark's RDD

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

发布日期: 2020年6月9日

Every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

Resilient Distributed Datasets (RDDs) are the fundamental object in Apache Spark. RDDs are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery. By nature, RDDs create new RDDs upon any operation such as transformation. They also store the lineage, which is used to recover from failures.

The following is a simple example of the RDD lineage:

RDD Fundamentals

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed, since Data resides on multiple nodes.

Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure. Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. It possesses self-recovery in the case of failure.

Features of RDD

In-memory computation

The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the performance by an order of magnitudes.

Lazy Evaluation

The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered. Thus, it limits how much work it has to do.

Fault Tolerance

Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one. Thus, we can easily recover the lost data.

Immutability

RDDs are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any transformation, it creates new RDD. We achieve consistency through immutability.

Persistence

We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk, this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in memory by calling persist() or cache() function.

Partition

RDD partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing and internally it has no division. Thus, it provides parallelism.

Parallel

Rdd, process the data parallelly over the cluster.

Location-Stickiness

RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of RDD. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus, speed up computation.

Coarse-grained Operation

We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual element in the data set of RDD.

Typed

We can have RDD of various types like: RDD [int], RDD [long], RDD [string].

No limitation

we can have any number of RDD. there is no limit to its number. the limit depends on the size of disk and memory.

Types of RDDs

By nature, RDDs create new RDDs. Basically there are following types of RDDs for example.

Every RDD knows the following 5 things about itself and that's the core concept behind it.

List of partitions
List of dependencies
Function to compute a partition from it parents
Location
Partitioning Information ( Key Value Pairs )

Lets look at some RDD Types:

HadoopRDD:

FilterRDD:

JoinedRDD:

Now that we have understood about RDDs. Let's try to answer some of most common questions.

How many ways a RDD can be created ?

Basically a RDD can be created in multiple ways but some of the common ways are:

Parallelising the Collection
From another RDD
From an External Data Source ( HDFS / S3 / Azure Blob)
From a Dataframe / Dataset ( df.rdd / ds.rdd )

Below is the example:

How can we create an empty RDD

What is the difference between textFile and wholeTextFile

textFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]

wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String, String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the file.

What is the RDD operator graph and how it is different from rdd lineage graph

Rdd operator graph is the lineage graph which tells the lineage of an rdd and by which spark achieves fault tolerance. It’s also called as rdd dependency graph

How can we read multiple files and create the single RDD.

We can create the RDD from multiple files in the same or different directories.

Conclusion:

So in this short article, we have covered the whereabouts of RDDs. We have understood what exactly the RDD? How many kind RDDs ? and tried to find out some of the most basic questions around it. Though 99.9 % times, you won't need to operate on RDDs but we need to understand that under the hood of a DataFrame or Dataset , the basic abstraction is RDD. As a spark developer our focus will always be on DataFrame only because it allows the Catalyst and Tungsten to optimise of our code and provide lighting fast performance.

Thanks for reading it and I hope you have found yourself informed. Please free to have any questions, I will be happy to answer them. If you liked the contents, please click the "like" button.

KindaTrader .

Day Trader at Self Employed

3 年

Can any one help me to solve below issue from databricks spark I have date column in below format and in string type from source ,i want convert it in to date type and normal date format.Given type conversion syntax is not supporting this date format or i am missing something dont know.Can anyone please suggest how to convert it to normal any date format. Date Jul 22, 2021 Jul 20, 2021 Jul 19, 2021 Jul 16, 2021 Jul 15, 2021 Jul 14, 2021 Jul 13, 2021 Jul 12, 2021 Jul 09, 2021 Jul 08, 2021 Jul 07, 2021 Jul 06, 2021 Jul 05, 2021 Jul 02, 2021 Jul 01, 2021

Dharmendra K.

Data Engineer @ The Henry M. Jackson Foundation | Azure, AWS, PySpark, Health Data

3 年

Great article

Gaive Gandhi

Technical Delivery Leader | Data And AI Professional | CoE Thought Leader | Building Delivery Team | Contributor At Quora

4 年

Nice article Deepak. Thanks for sharing it. However, there is a small mistake in the below line from your article: By nature, RDDs create new RDDs upon any operation such as transformation or action. This is not entirely true. When the action is triggered, new RDD is not formed like transformation. In fact, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system.

4 次回应

Vijai Pasad

Big Data Analyst and Technical Lead at UOB Bank

4 年

Blown up the extensive deep info of RDD’s...Thanks a lot to you Deepak.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Anatomy of Apache Spark's RDD

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

RDD Fundamentals

Features of RDD

In-memory computation

Lazy Evaluation

Fault Tolerance

Immutability

Persistence

Parallel

Location-Stickiness

Coarse-grained Operation

Typed

No limitation

Types of RDDs

Lets look at some RDD Types:

HadoopRDD:

FilterRDD:

JoinedRDD:

How many ways a RDD can be created ?

How can we create an empty RDD

What is the difference between textFile and wholeTextFile

What is the RDD operator graph and how it is different from rdd lineage graph

How can we read multiple files and create the single RDD.

Conclusion:

更多精彩文章

社区洞察

其他会员也浏览了

Mastering Spark Session Creation and Configuration in Apache Spark

Simplifying Apache Spark usage with Optimus

Apache Spark : The Shuffle

Apache Spark 101: Window Functions

Repartition and Coalesce in Apache Spark

Tip of the Apache Iceberg

April 2023 - Iceberg Community News

Apache Pulsar

Ingesting Data from Apache Pulsar Using Hudi Delta Streamer: A Step-by-Step Guide

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL

RDD Fundamentals

Features of RDD

In-memory computation

Lazy Evaluation

Fault Tolerance

Immutability

Persistence

Parallel

Location-Stickiness

Coarse-grained Operation

Typed

No limitation

Types of RDDs

Lets look at some RDD Types:

HadoopRDD:

FilterRDD:

JoinedRDD:

How many ways a RDD can be created ?

How can we create an empty RDD

What is the difference between textFile and wholeTextFile

What is the RDD operator graph and how it is different from rdd lineage graph

How can we read multiple files and create the single RDD.

Conclusion:

Multi Tasks Job in Databricks

2022年1月12日

Deploying Databricks on Azure

2022年1月10日

Databricks SQL - The new Cloud Data Ware(Lake)house

2021年11月10日

Create Tables in Databricks & Query it from AWS Athena

2021年11月8日

AWS Glue Data Catalog as the Metastore for Databricks

2021年11月1日

Deploying Databricks on AWS

2021年10月31日

Danny's Diner Case Study using Pyspark on Databricks

2021年10月6日

Azure Cloud Data Engineering

2021年6月8日

Deploying Databricks on Google Cloud Platform

2021年4月13日

CI / CD in Azure Databricks using Azure DevOps

2021年4月9日

社区洞察

其他会员也浏览了

Mastering Spark Session Creation and Configuration in Apache Spark

Simplifying Apache Spark usage with Optimus

Apache Spark : The Shuffle

Apache Spark 101: Window Functions

Repartition and Coalesce in Apache Spark

Tip of the Apache Iceberg

April 2023 - Iceberg Community News

Apache Pulsar

Ingesting Data from Apache Pulsar Using Hudi Delta Streamer: A Step-by-Step Guide

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL