登录查看更多内容

Partitioning and Bucketing in Apache Spark

Arabinda Mohapatra

LWD-17th Jan 2025 || Data Engineer @ Wells Fargo || Pyspark, Alteryx,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

发布日期: 2024年8月28日

+ 关注

Partitioning?and?bucketing?are two powerful techniques in Apache Spark that help optimize data processing and query performance. Here’s a detailed look at both methods and when to use them.

Partitioning in Spark

Partitioning splits data into separate folders on disk based on one or multiple columns. This enables efficient parallelism and partition pruning, which optimizes queries by skipping unnecessary data.

Implementation:?

Partitioning is done using the?.partitionBy()?method of the?DataFrameWriter?class. You need to specify the columns to partition by, and Spark will save each partition in a separate folder on disk. The number of resulting files is controlled by the?spark.sql.shuffle.partitions?setting.

# Create a SparkSession spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()

# Load a dataset df = spark.read.format("csv").option("header", "true").load("path/to/dataset")

# Partition the dataset by the "date" column df.write.partitionBy("date").format("parquet").save("path/to/partitioned/dataset")

In this example, the dataset is partitioned by the “date” column and saved as a Parquet file.

Bucketing in Spark

Bucketing assigns rows to specific buckets and collocates them on disk, which is useful for wide transformations like joins and aggregations. Bucketing reduces the need for shuffling data across partitions.

Implementation:?

Arno Wakfer MCT 6 个月前

Despite Uniform and Apache XTable, your choice of…

Alex Merced 5 个月前

Mastering Spark Session Creation and Configuration in…

Sachin D N ???? 8 个月前

Bucketing is done using the?.bucketBy()?method of the?DataFrameWriter?class. You need to specify the number of buckets and the column to bucket by. The bucket number is calculated using a hash function on the bucket column.

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder.appName("BucketingExample").getOrCreate()

# Load a dataset

df = spark.read.format("csv").option("header", "true").load("path/to/dataset")

# Bucket the dataset by the "id" column into 10 buckets

No of bucket= Total dataset size/default block size
default block size=128 MB
Total dataset size= total no of rorecords*variable*datatype
variable=no of columns

df.write.bucketBy(10, "id").sortBy("id").format("parquet").save("path/to/bucketed/dataset")

When to Use Partitioning and Bucketing

Partitioning:?Use partitioning when you frequently filter on a column with low cardinality. This helps in skipping unnecessary data and speeds up query performance.
Bucketing:?Use bucketing for complex operations like joins, groupBys, and windowing on columns with high cardinality. Bucketing helps in reducing shuffling and sorting costs.

Partitioning and Bucketing in Apache Spark

Arabinda Mohapatra

LWD-17th Jan 2025 || Data Engineer @ Wells Fargo || Pyspark, Alteryx,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Simplifying Apache Spark usage with Optimus

Handling Nested Schema in Apache Spark

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

Anatomy of Apache Spark's RDD

Apache Spark 101: Window Functions

Mastering DataFrame Transformations in Apache Spark

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

Repartition and Coalesce in Apache Spark

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL

Dataframe Hints in Apache Spark

领英推荐

Zerodha uses PostgreSQL

2024年11月9日

AWS Glue vs. Amazon EMR: Which One to Choose?

2024年10月6日

Parking Space - Data Modelling

2024年9月23日

Index Fragmentation: Definition, Detection, and Impacts

2024年9月23日

Optimizing SQL Query Performance: A Comprehensive Guide

2024年9月12日

Understanding Amazon S3 Storage Classes

2024年9月5日

Unlocking Performance: Best Practices for Amazon Redshift Table Design

2024年9月4日

groupByKey vs reduceByKey

2024年9月1日

Why UDFs (User Defined Functions) is slow

2024年8月31日

Tackling the “Large Number of Small Files” Problem in Spark

2024年8月28日

社区洞察

其他会员也浏览了

Simplifying Apache Spark usage with Optimus

Handling Nested Schema in Apache Spark

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

Anatomy of Apache Spark's RDD

Apache Spark 101: Window Functions

Mastering DataFrame Transformations in Apache Spark

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

Repartition and Coalesce in Apache Spark

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL

Dataframe Hints in Apache Spark