Partitioning and Bucketing in Apache Spark
Arabinda Mohapatra
LWD-17th Jan 2025 || Data Engineer @ Wells Fargo || Pyspark, Alteryx,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom
Partitioning and Bucketing in Apache Spark
Partitioning in Spark
Implementation:?
# Create a SparkSession spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()
# Load a dataset df = spark.read.format("csv").option("header", "true").load("path/to/dataset")
# Partition the dataset by the "date" column df.write.partitionBy("date").format("parquet").save("path/to/partitioned/dataset")
In this example, the dataset is partitioned by the “date” column and saved as a Parquet file.
Bucketing in Spark
Implementation:?
领英推荐
from pyspark.sql import SparkSession
?
# Create a SparkSession
spark = SparkSession.builder.appName("BucketingExample").getOrCreate()
?
# Load a dataset
df = spark.read.format("csv").option("header", "true").load("path/to/dataset")
?
# Bucket the dataset by the "id" column into 10 buckets
df.write.bucketBy(10, "id").sortBy("id").format("parquet").save("path/to/bucketed/dataset")
When to Use Partitioning and Bucketing
?