登录查看更多内容

#25: Transformation and Action in Apache Spark

Mohammad Azzam

Immediate Joiner Snaplogic Developer| Python | SQL | Spark | PySpark | Databricks | SnapLogic | ADF | Glue | Redshift | S3 | AWS Certified x2 | Databricks Certified Data Engineer Associate | SnapLogic Certified

发布日期: 2024年4月2日

In Apache Spark, there are two types of operations that can be applied to RDDs (Resilient Distributed Datasets): transformations and actions. Here's a breakdown of each:

Transformations:

Transformations create a new RDD from an existing RDD.?
However, transformations are lazy, meaning they do not compute their results immediately. Instead, they create a lineage graph (DAG) representing the sequence of transformations applied to the base dataset.?
Spark keeps track of these transformations and only computes them when an action is called.?
This lazy evaluation allows Spark to optimize the execution plan.

Common transformations include map(), filter(), flatMap(), reduceByKey(), join(), groupByKey(), etc. These transformations typically perform data processing tasks like filtering, mapping, aggregating, joining, and sorting data.

Actions:

Actions, on the other hand, trigger the execution of the transformations and produce some output.?
When an action is called on an RDD, Spark evaluates the lineage graph and computes the result, which might involve executing the transformations on the distributed dataset across the cluster.?
Actions are the operations that initiate the actual computation and return the results to the driver program or write data to external storage.
Examples of actions include collect(), count(), reduce(), take(), saveAsTextFile(), foreach(), etc.?
These actions perform tasks such as collecting data to the driver, counting elements in an RDD, reducing elements to a single result, taking a sample of data, saving RDDs to external storage, or executing a function on each element of the RDD.

In summary, transformations are used to build a directed acyclic graph (DAG) of computation, describing how data is transformed from one RDD to another, while actions execute the computations and produce final results or write data to external storage.

要查看或添加评论，请登录

Mohammad Azzam的更多文章

#33 what is broadcast join in spark

2024年4月22日

#33 what is broadcast join in spark

In Apache Spark, a "broadcast join" is a type of join operation used to optimize performance when joining large and…
#32 Repartition vs coalsece

2024年4月12日

#32 Repartition vs coalsece

repartition() and coalesce() are both methods in Apache Spark used to manage the number of partitions in an RDD or…
#31: Partitions in spark

2024年4月10日

#31: Partitions in spark

In Apache Spark, partitions are the basic units of parallelism and data distribution. When you create an RDD (Resilient…
#30 Task, job and stage in spark

2024年4月9日

#30 Task, job and stage in spark

In Apache Spark, jobs, tasks, and stages are fundamental concepts that play a crucial role in the distributed execution…
#29 ReduceBy() key vs groupBy() key in spark RDD

2024年4月8日

#29 ReduceBy() key vs groupBy() key in spark RDD

In the context of Apache Spark's Resilient Distributed Datasets (RDDs), both reduceByKey and groupByKey are…
#28: reduce VS reduceByKey in Apache Spark RDDs

2024年4月5日

#28: reduce VS reduceByKey in Apache Spark RDDs

reduce() and reduceByKey() are two distinct operations available in Apache Spark, a distributed computing framework for…

2 条评论
#27 Narrow vs Wide Transformations in Spark

2024年4月4日

#27 Narrow vs Wide Transformations in Spark

In Apache Spark, transformations are broadly categorized into two types based on how they operate across partitions of…
#26: Shuffling and Sorting in Apache Spark

2024年4月3日

#26: Shuffling and Sorting in Apache Spark

Shuffling and sorting are fundamental operations in Apache Spark, especially in distributed data processing. They play…
#24: 10 Majorly Used Transformations in RDDs (Resilient Distributed Datasets)

2024年4月1日

#24: 10 Majorly Used Transformations in RDDs (Resilient Distributed Datasets)

Certainly! Here are 10 majorly used transformations in RDDs (Resilient Distributed Datasets) in Apache Spark:…
#23 RDD Transformation and Action Operations Example with PySpark -B

2024年3月29日

#23 RDD Transformation and Action Operations Example with PySpark -B

Continuing from the previous post by using the same RDD created. If you haven't gone through the post A here is the…

See all articles

#25: Transformation and Action in Apache Spark

Mohammad Azzam

Immediate Joiner Snaplogic Developer| Python | SQL | Spark | PySpark | Databricks | SnapLogic | ADF | Glue | Redshift | S3 | AWS Certified x2 | Databricks Certified Data Engineer Associate | SnapLogic Certified

Transformations:

Actions:

Mohammad Azzam的更多文章

社区洞察

其他会员也浏览了

How to Spot and Fix Performance Problems in Apache Spark

Spark Optimization Strategies

Expedite Apache Spark Queries with Bloom Filter Indexing

Apache Spark : The Shuffle

A Beginner’s Take on Spark Query and Storage Optimizations

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

Spark - Managers' snapshot

Apache Spark 101: Window Functions

Spark Performance Tuning: Spill

Transformations:

Actions:

Mohammad Azzam的更多文章

#33 what is broadcast join in spark

#32 Repartition vs coalsece

#31: Partitions in spark

#30 Task, job and stage in spark

#29 ReduceBy() key vs groupBy() key in spark RDD

#28: reduce VS reduceByKey in Apache Spark RDDs

#27 Narrow vs Wide Transformations in Spark

#26: Shuffling and Sorting in Apache Spark

#24: 10 Majorly Used Transformations in RDDs (Resilient Distributed Datasets)

#23 RDD Transformation and Action Operations Example with PySpark -B

社区洞察

其他会员也浏览了

How to Spot and Fix Performance Problems in Apache Spark

Spark Optimization Strategies

Expedite Apache Spark Queries with Bloom Filter Indexing

Apache Spark : The Shuffle

A Beginner’s Take on Spark Query and Storage Optimizations

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

Spark - Managers' snapshot

Apache Spark 101: Window Functions

Spark Performance Tuning: Spill