登录查看更多内容

Partitioning and Bucketing in Apache Spark

Arabinda Mohapatra

Pyspark, SnowFlake,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

发布日期: 2024年8月28日

+ 关注

Partitioning?and?bucketing?are two powerful techniques in Apache Spark that help optimize data processing and query performance. Here’s a detailed look at both methods and when to use them.

Partitioning in Spark

Partitioning splits data into separate folders on disk based on one or multiple columns. This enables efficient parallelism and partition pruning, which optimizes queries by skipping unnecessary data.

Implementation:?

Partitioning is done using the?.partitionBy()?method of the?DataFrameWriter?class. You need to specify the columns to partition by, and Spark will save each partition in a separate folder on disk. The number of resulting files is controlled by the?spark.sql.shuffle.partitions?setting.

# Create a SparkSession spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()

# Load a dataset df = spark.read.format("csv").option("header", "true").load("path/to/dataset")

# Partition the dataset by the "date" column df.write.partitionBy("date").format("parquet").save("path/to/partitioned/dataset")

In this example, the dataset is partitioned by the “date” column and saved as a Parquet file.

Bucketing in Spark

Bucketing assigns rows to specific buckets and collocates them on disk, which is useful for wide transformations like joins and aggregations. Bucketing reduces the need for shuffling data across partitions.

Implementation:?

领英推荐

How to Spot and Fix Performance Problems in Apache…

Muskan Bansal 3 个月前

Using Airbyte with Tabular

Tabular (now part of Databricks) 1 年前

Mastering Spark Session Creation and Configuration in…

Sachin D N ???? 1 年前

Bucketing is done using the?.bucketBy()?method of the?DataFrameWriter?class. You need to specify the number of buckets and the column to bucket by. The bucket number is calculated using a hash function on the bucket column.

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder.appName("BucketingExample").getOrCreate()

# Load a dataset

df = spark.read.format("csv").option("header", "true").load("path/to/dataset")

# Bucket the dataset by the "id" column into 10 buckets

No of bucket= Total dataset size/default block size
default block size=128 MB
Total dataset size= total no of rorecords*variable*datatype
variable=no of columns

df.write.bucketBy(10, "id").sortBy("id").format("parquet").save("path/to/bucketed/dataset")

When to Use Partitioning and Bucketing

Partitioning:?Use partitioning when you frequently filter on a column with low cardinality. This helps in skipping unnecessary data and speeds up query performance.
Bucketing:?Use bucketing for complex operations like joins, groupBys, and windowing on columns with high cardinality. Bucketing helps in reducing shuffling and sorting costs.

要查看或添加评论，请登录

Arabinda Mohapatra的更多文章

A Deep Dive into Caching Strategies in Snowflake

2025年3月22日

A Deep Dive into Caching Strategies in Snowflake

What is Caching? Caching is a technique used to store the results of previously executed queries or frequently accessed…
A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

2025年3月16日

A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

An external table is a Snowflake feature that allows you to query data stored in an external stage as if the data were…
Apache Iceberg

2025年3月16日

Apache Iceberg

Apache Iceberg Apache Iceberg is an open-source table format designed to handle large-scale analytic datasets…
Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

2025年2月24日

Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

1. Table Storage Metrics select TABLE_SCHEMA,TABLE_CATALOG AS"DB",TABLE_SCHEMA, TABLE_NAME,sum(ACTIVE_BYTES) +…

1 条评论
Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

2025年2月23日

Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

USE WAREHOUSE LRN; USE DATABASE LRN_DB; USE SCHEMA LEARNING; ---Create a Table in snowflake as per the source data…

1 条评论
Data Loading with Snowflake's COPY INTO Command-Table

2025年2月18日

Data Loading with Snowflake's COPY INTO Command-Table

Snowflake's COPY INTO command is a powerful tool for data professionals, streamlining the process of loading data from…
SNOW-SQL in SNOWFLAKE

2025年2月17日

SNOW-SQL in SNOWFLAKE

SnowSQL is a command-line tool designed by Snowflake to interact with Snowflake databases. It allows users to execute…
Stages in Snowflake

2025年2月9日

Stages in Snowflake

Stages in Snowflake play a crucial role in data loading and unloading processes. They serve as intermediary storage…
Snowflake Tips

2025年2月8日

Snowflake Tips

??Tip 1: Use the USE statement to switch between warehouses Instead of specifying the warehouse name in every query…
SnowFlake

2025年2月8日

SnowFlake

??What is a Virtual Warehouse in Snowflake? ??A Virtual Warehouse in Snowflake is a cluster of compute resources that…

See all articles

Partitioning and Bucketing in Apache Spark

Arabinda Mohapatra

Pyspark, SnowFlake,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

领英推荐

Arabinda Mohapatra的更多文章

社区洞察

其他会员也浏览了

Simplifying Apache Spark usage with Optimus

Handling Nested Schema in Apache Spark

A Beginner’s Take on Spark Query and Storage Optimizations

Anatomy of Apache Spark's RDD

Apache Spark 101: Window Functions

Mastering DataFrame Transformations in Apache Spark

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

April 2023 - Iceberg Community News

Dataframe Hints in Apache Spark

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL

领英推荐

Arabinda Mohapatra的更多文章

A Deep Dive into Caching Strategies in Snowflake

A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

Apache Iceberg

Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

Data Loading with Snowflake's COPY INTO Command-Table

SNOW-SQL in SNOWFLAKE

Stages in Snowflake

Snowflake Tips

SnowFlake

社区洞察

其他会员也浏览了

Simplifying Apache Spark usage with Optimus

Handling Nested Schema in Apache Spark

A Beginner’s Take on Spark Query and Storage Optimizations

Anatomy of Apache Spark's RDD

Apache Spark 101: Window Functions

Mastering DataFrame Transformations in Apache Spark

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

April 2023 - Iceberg Community News

Dataframe Hints in Apache Spark

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL