What is an RDD?
An RDD is a fundamental data structure of Apache Spark. It represents an immutable distributed collection of objects that can be processed in parallel across a cluster.
Key Characteristics of RDDs:
Creating RDDs:
You can create RDDs in two main ways:
领英推荐
Example:
Here’s a simple example of creating and using an RDD in PySpark:
# Import necessary libraries
from pyspark import SparkContext, SparkConf
# Initialize Spark Context
conf = SparkConf().setAppName("Simple RDD Example")
sc = SparkContext.getOrCreate(conf=conf)
# Create an RDD by parallelizing a collection
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Perform a transformation (map)
squared_rdd = rdd.map(lambda x: x * x)
# Perform an action (collect)
result = squared_rdd.collect()
# Print the result
print(result) # Output: [1, 4, 9, 16, 25]
Common RDD Operations:
Example of Transformations and Actions:
# Transformation: filter even numbers
even_rdd = rdd.filter(lambda x: x % 2 == 0)
# Action: collect the result
even_numbers = even_rdd.collect()
print(even_numbers) # Output: [2, 4]
RDDs are the core abstraction in Spark, allowing for distributed data processing and fault tolerance. They provide a powerful way to perform parallel computations on large datasets.
Data engineer trainee @ITI
8 个月???? ??? ????? ??? ??? ??????