登录查看更多内容

#24: 10 Majorly Used Transformations in RDDs (Resilient Distributed Datasets)

Mohammad Azzam

Immediate Joiner Snaplogic Developer| Python | SQL | Spark | PySpark | Databricks | SnapLogic | ADF | Glue | Redshift | S3 | AWS Certified x2 | Databricks Certified Data Engineer Associate | SnapLogic Certified

发布日期: 2024年4月1日

+ 关注

Certainly! Here are 10 majorly used transformations in RDDs (Resilient Distributed Datasets) in Apache Spark:

map(func):

Applies a function to each element of the RDD.
Example: rdd.map (lambda x: x * 2)

filter(func):

Filters elements based on a predicate function.
Example: rdd.filter(lambda x: x % 2 == 0)

flatMap(func):

Similar to map, but each input item can be mapped to 0 or more output items.
Example: rdd.flatMap(lambda x: (x, x*2))

reduceByKey(func):

Combines values with the same key using a specified reduce function.
Example: rdd.reduceByKey(lambda x, y: x + y)

groupByKey():

Groups the values for each key in the RDD into a single sequence.
Example: rdd.groupByKey()

sortByKey():

Sorts the RDD by key.
Example: rdd.sortByKey()

join(otherRDD):

Performs an inner join between two RDDs based on their keys.
Example: rdd1.join(rdd2)

distinct():

Returns a new RDD containing distinct elements from the original RDD.
Example: rdd.distinct()

mapPartitions(func):

Similar to map, but operates on each partition of the RDD separately.
Example: rdd.mapPartitions(lambda partition: [x*2 for x in partition])

cogroup(otherRDD):

Groups the values for each key in both RDDs and performs a cogroup operation.
Example: rdd.cogroup(otherRDD)

These transformations are commonly used in Spark applications for various data processing tasks, such as filtering, mapping, aggregating, joining, and sorting data distributed across a cluster.

要查看或添加评论，请登录

Mohammad Azzam的更多文章

#33 what is broadcast join in spark

2024年4月22日

#33 what is broadcast join in spark

In Apache Spark, a "broadcast join" is a type of join operation used to optimize performance when joining large and…
#32 Repartition vs coalsece

2024年4月12日

#32 Repartition vs coalsece

repartition() and coalesce() are both methods in Apache Spark used to manage the number of partitions in an RDD or…
#31: Partitions in spark

2024年4月10日

#31: Partitions in spark

In Apache Spark, partitions are the basic units of parallelism and data distribution. When you create an RDD (Resilient…
#30 Task, job and stage in spark

2024年4月9日

#30 Task, job and stage in spark

In Apache Spark, jobs, tasks, and stages are fundamental concepts that play a crucial role in the distributed execution…
#29 ReduceBy() key vs groupBy() key in spark RDD

2024年4月8日

#29 ReduceBy() key vs groupBy() key in spark RDD

In the context of Apache Spark's Resilient Distributed Datasets (RDDs), both reduceByKey and groupByKey are…
#28: reduce VS reduceByKey in Apache Spark RDDs

2024年4月5日

#28: reduce VS reduceByKey in Apache Spark RDDs

reduce() and reduceByKey() are two distinct operations available in Apache Spark, a distributed computing framework for…

2 条评论
#27 Narrow vs Wide Transformations in Spark

2024年4月4日

#27 Narrow vs Wide Transformations in Spark

In Apache Spark, transformations are broadly categorized into two types based on how they operate across partitions of…
#26: Shuffling and Sorting in Apache Spark

2024年4月3日

#26: Shuffling and Sorting in Apache Spark

Shuffling and sorting are fundamental operations in Apache Spark, especially in distributed data processing. They play…
#25: Transformation and Action in Apache Spark

2024年4月2日

#25: Transformation and Action in Apache Spark

In Apache Spark, there are two types of operations that can be applied to RDDs (Resilient Distributed Datasets):…
#23 RDD Transformation and Action Operations Example with PySpark -B

2024年3月29日

#23 RDD Transformation and Action Operations Example with PySpark -B

Continuing from the previous post by using the same RDD created. If you haven't gone through the post A here is the…

See all articles

#24: 10 Majorly Used Transformations in RDDs (Resilient Distributed Datasets)

Mohammad Azzam

Immediate Joiner Snaplogic Developer| Python | SQL | Spark | PySpark | Databricks | SnapLogic | ADF | Glue | Redshift | S3 | AWS Certified x2 | Databricks Certified Data Engineer Associate | SnapLogic Certified

Mohammad Azzam的更多文章

社区洞察

其他会员也浏览了

Lakes, Lakehouses, Warehouse and.....MDM?

Different Ways of Creating a DataFrame in Spark

Microsoft Fabric! A loader app created using the Spark Copilot, PySpark and the Fabric Lakehouses... (Part 2: The control flow!)

Publications & Projects

"Vector" Is The New Black for Databases

A Practical Guide to DataFrame API vs. Spark SQL

Side mini-project: Ingestion from WS

How Spark distributes the load across multiple machines?

# Creating a Collection in Milvus: A Step-by-Step Guide

Spark memory configuration approach

Mohammad Azzam的更多文章

#33 what is broadcast join in spark

#32 Repartition vs coalsece

#31: Partitions in spark

#30 Task, job and stage in spark

#29 ReduceBy() key vs groupBy() key in spark RDD

#28: reduce VS reduceByKey in Apache Spark RDDs

#27 Narrow vs Wide Transformations in Spark

#26: Shuffling and Sorting in Apache Spark

#25: Transformation and Action in Apache Spark

#23 RDD Transformation and Action Operations Example with PySpark -B

社区洞察

其他会员也浏览了

Lakes, Lakehouses, Warehouse and.....MDM?

Different Ways of Creating a DataFrame in Spark

Microsoft Fabric! A loader app created using the Spark Copilot, PySpark and the Fabric Lakehouses... (Part 2: The control flow!)

Publications & Projects

"Vector" Is The New Black for Databases

A Practical Guide to DataFrame API vs. Spark SQL

Side mini-project: Ingestion from WS

How Spark distributes the load across multiple machines?

# Creating a Collection in Milvus: A Step-by-Step Guide

Spark memory configuration approach