登录查看更多内容

Partition By vs Bucket By: Which One Should You Choose?

Vishwajeet Dabholkar

Solutions Engineer| Prompt Engineer| GenAI | Vectors DBs | RAG Applications | LLM applications | Data Engineer | Data Streaming | RAG Expert |

发布日期: 2023年5月2日

Big data is a term that refers to the ever-increasing amounts of data being generated in today's world. As the amount of data grows, so does the need for efficient storage and processing of this data. In this blog, we will discuss two important procedures, partitioning and bucketing, which help address these challenges when working with big data in Apache Spark.

Partitioning in Spark

Partitioning is a way to split data into separate folders on disk based on one or multiple columns. This enables efficient parallelism and partition pruning in Spark. Partition pruning is a technique used to optimize queries by skipping reading parts of the data that are not required.

In Spark, partitioning is implemented by the .partitionBy() method of the DataFrameWriter class. To partition a dataset, you need to provide the method with one or multiple columns to partition by. The dataset is then written to disk split by the partitioning column, with each partition saved into a separate folder on disk. Each folder can maintain multiple files, with the amount of resulting files controlled by the setting spark.sql.shuffle.partitions.

Here is an example of how to partition a dataset in Spark:

from pyspark.sql import SparkSessio


# Create a SparkSession
spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()


# Load a dataset
df = spark.read.format("csv").option("header", "true").load("path/to/dataset")


# Partition the dataset by the "date" column
df.write.partitionBy("date").format("parquet").save("path/to/partitioned/dataset")

In the above example, we loaded a dataset and partitioned it by the "date" column using the .partitionBy() method. The resulting partitioned dataset is then saved as a parquet file in the specified directory.

Bucketing in Spark

Bucketing is a way to assign rows of a dataset to specific buckets and collocate them on disk. This enables efficient wide transformations in Spark, as the data is already collocated in the executors correctly. Wide transformations are operations that require shuffling data across partitions, which can be a costly operation.

In Spark, bucketing is implemented by the .bucketBy() method of the DataFrameWriter class. To bucket a dataset, you need to provide the method with the number of buckets you want to create and the column to bucket by. The bucket number for a given row is assigned by calculating a hash on the bucket column and performing modulo by the number of desired buckets operation on the resulting hash.

领英推荐

Data Lakehouse Roundup #1 - News and Insights on the…

Alex Merced 5 个月前

Despite Uniform and Apache XTable, your choice of…

Alex Merced 8 个月前

Top big data tools and technologies in 2024

Net Talent 1 年前

Here is an example of how to bucket a dataset in Spark:

from pyspark.sql import SparkSessio


# Create a SparkSession
spark = SparkSession.builder.appName("BucketingExample").getOrCreate()


# Load a dataset
df = spark.read.format("csv").option("header", "true").load("path/to/dataset")


# Bucket the dataset by the "id" column into 10 buckets
df.write.bucketBy(10, "id").sortBy("id").format("parquet").save("path/to/bucketed/dataset")

In the above example, we loaded a dataset and bucketed it by the "id" column into 10 buckets using the .bucketBy() method. The resulting bucketed dataset is then sorted by the "id" column and saved as a parquet file in the specified directory.

When to use partitioning and bucketing?

If you will often perform filtering on a given column and it is of low cardinality, partition on that column. If you will be performing complex operations like joins, groupBys, and windowing and the column is of high cardinality, consider bucketing on that column.

However, bucketing is complicated and requires careful consideration of nuances and caveats. For example, there are conditions that need to be met between two datasets in order for bucketing to have the desired effect. Additionally, bucketing can only be used when the data is saved as a table, as the metadata of the buckets needs to be saved somewhere, usually in a Hive metadata store.

Conclusion

When working with big data in Spark, it is important to consider how the data is stored both on disk and in memory. Partitioning and bucketing are two procedures that can help optimize the storage and processing of large datasets. Partitioning enables efficient parallelism and partition pruning, while bucketing enables efficient wide transformations. However, bucketing is a complicated procedure that requires careful consideration of the nuances and caveats involved.

In summary, partitioning and bucketing are important tools to have in your big data arsenal when working with Spark. By using these techniques, you can optimize the storage and processing of large datasets, making your data processing pipelines faster and more efficient.

Gaurav Singh

1 年

Very informative blog Vishwajeet From my understanding,what I have studied about bucketing and partitioning in hive, if we have a table with a column with most frequently occurring entries or less uniques entries, we should do partition But, if we have a table where we do not have such column or every column has mostly uniques values then we should do bucketing I mean, your blog affirms this already but I just wanted to share my understanding too ??

3 次回应

查看更多评论

要查看或添加评论，请登录

Vishwajeet Dabholkar的更多文章

Navigating Vector Indexes in SingleStore: A Detailed Guide

2024年4月24日

Navigating Vector Indexes in SingleStore: A Detailed Guide

In the world of database management, particularly when dealing with high-dimensional data, choosing the right vector…
Understanding Text Embeddings

2023年10月19日

Understanding Text Embeddings

Hey there! Curious about text embeddings? Let's unravel the magic behind this technology. By the end, you'll have a…

1 条评论
From Nested Chaos to Structured Insight: The SingleStore Way

2023年8月25日

From Nested Chaos to Structured Insight: The SingleStore Way

Introduction: In the modern data landscape, nested JSONs have become the norm rather than the exception. Whether it's…
Efficiently Ingesting Nested JSONs with SingleStore: A Real-World Example!

2023年8月24日

Efficiently Ingesting Nested JSONs with SingleStore: A Real-World Example!

Problem Statement: Ingesting nested JSON structures has always been a challenge for data engineers. Take a look at this…
The Marvels of Large Language Models: A Deep Dive into the Future of NLP

2023年8月22日

The Marvels of Large Language Models: A Deep Dive into the Future of NLP

1. Introduction to Large Language Models Have you ever wondered how some applications can generate human-like text? The…

1 条评论
Streamlining CSV Data Ingestion in SingleStore

2023年7月31日

Streamlining CSV Data Ingestion in SingleStore

Hello all, Today, I'd like to discuss a common challenge encountered during CSV data ingestion in SingleStore, and more…
The Uncharted Frontier of AI and Data Engineering

2023年6月22日

The Uncharted Frontier of AI and Data Engineering

In an era driven by data, the rapidly evolving field of Data Engineering continues to transform the world of…
ELT with PySpark: A Comprehensive Guide

2023年2月23日

ELT with PySpark: A Comprehensive Guide

As a data engineer or big data professional, you're probably familiar with the concept of ETL (Extract, Transform…
Maximizing Performance: Understanding the Difference Between Normal Join vs Broadcast Join for Spark Interviews

2023年2月13日

Maximizing Performance: Understanding the Difference Between Normal Join vs Broadcast Join for Spark Interviews

As a data engineer, being proficient in Spark is a crucial skill in today's job market. One key aspect of Spark that…
Exploring the String Functions in Spark SQL: A Guide with Examples

2023年2月9日

Exploring the String Functions in Spark SQL: A Guide with Examples

In this blog, we will explore the string functions in Spark SQL, which are grouped under the name "string_funcs". These…

2 条评论

See all articles

Partition By vs Bucket By: Which One Should You Choose?

Vishwajeet Dabholkar

Solutions Engineer| Prompt Engineer| GenAI | Vectors DBs | RAG Applications | LLM applications | Data Engineer | Data Streaming | RAG Expert |

领英推荐

Vishwajeet Dabholkar的更多文章

社区洞察

其他会员也浏览了

Working with Semi-Structured JSON Data in Databricks

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

?? DATA Pill #136 - From Apache Iceberg to Real-Time AI: Trends, Tutorials, and Tools for Modern Data Pros

Tackling the “Large Number of Small Files” Problem in Spark

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Top 10 big data platforms – Part 1

Predicate vs Projection Pushdown in Spark 3

Spark Tidbits - Lesson 6

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Sometimes, You DON’T Really Need a Distributed System

领英推荐

Vishwajeet Dabholkar的更多文章

Navigating Vector Indexes in SingleStore: A Detailed Guide

Understanding Text Embeddings

From Nested Chaos to Structured Insight: The SingleStore Way

Efficiently Ingesting Nested JSONs with SingleStore: A Real-World Example!

The Marvels of Large Language Models: A Deep Dive into the Future of NLP

Streamlining CSV Data Ingestion in SingleStore

The Uncharted Frontier of AI and Data Engineering

ELT with PySpark: A Comprehensive Guide

Maximizing Performance: Understanding the Difference Between Normal Join vs Broadcast Join for Spark Interviews

Exploring the String Functions in Spark SQL: A Guide with Examples

社区洞察

其他会员也浏览了

Working with Semi-Structured JSON Data in Databricks

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

?? DATA Pill #136 - From Apache Iceberg to Real-Time AI: Trends, Tutorials, and Tools for Modern Data Pros

Tackling the “Large Number of Small Files” Problem in Spark

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Top 10 big data platforms – Part 1

Predicate vs Projection Pushdown in Spark 3

Spark Tidbits - Lesson 6

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Sometimes, You DON’T Really Need a Distributed System