Optimize Google BigQuery

Optimize Google BigQuery

I love BigQuery and think It is one of the best products ever made by the Google Cloud Platform.

As someone who works in data engineering, I have found BigQuery to be one of the best products offered by the Google Cloud Platform. One feature I frequently use is the MERGE statement, which allows for UPSERT (Update + Insert) in BigQuery Native tables.

Having spent over a year using BigQuery at work, I've discovered several effective ways to optimize the merge statement. In this discussion, I'll share each of these strategies and explain why they have proven helpful in my experience.



1. What is MERGE in BigQuery?

Let's first understand what is MERGE in the context of Google BigQuery and what's better than understanding it by one example.

Sure! Below is an example of a BigQuery MERGE statement, also known as an upsert operation. In this example, we have two tables: source_table and target_table. We want to merge data from source_table into target_table based on a common key column (id).


Assume the schema for both tables is as follows:

source & target table:

- id INTEGER

- name STRING

- age INTEGER

- programming STRING

- created DATE

- updated DATE


The goal is to update existing rows in target_table if the `id` already exists or insert new rows if the `id` doesn't exist in target_table.


No alt text provided for this image

Explanation:


  1. The `MERGE` keyword indicates the start of the merge operation.
  2. We provide aliases (`target` and `source`) for the `target_table` and `source_table`, respectively, to use them in the `WHEN` clauses.
  3. ON `target.id = source.id` specifies the condition for matching rows in both tables based on the `id` column.
  4. WHEN MATCHED THEN indicates what to do when a match is found. In this case, we update the existing row in the `target_table` with data from the `source_table`.
  5. WHEN NOT MATCHED THEN specifies what to do when there is no match. Here, we insert a new row into the `target_table` using data from the `source_table`.


By running this MERGE statement, you can efficiently synchronize data between the `source_table` and `target_table`, ensuring that any updates or new records are correctly handled. It is very beneficial when you are running your pipeline daily or even streaming your data using Apache Spark or any other computing engine and want to keep only the latest updated data in your final BigQuery native table.



2. Ways to optimize MERGE

Now let's try to understand different ways to optimize MERGE in BigQuery & and before optimizing anything we should ask need of optimizing it questions to ourselves.

  • Discovering ways to enhance the MERGE process can lead to better outcomes.
  • It will drastically reduce your execution time of the big data pipeline.
  • Every job running on BigQuery comes with cost and we should always work to reduce the cost.

Enough of introduction, theory & philosophy :) Let's jump to the real thing i.e. our Engineering of optimization!!! Let's note down the point and then we will try to understand each and every optimization in detail.


  1. The first thing you should do is apply Clustering: If possible, cluster the tables on the merge keys. Clustering reorders the data physically on disk, which can improve query performance.
  2. Bigquery Merge Optimization through Partition Pruning. BigQuery prune partitions based on static predicates rather than dynamic predicates.
  3. If you are sure that your incoming data can only have updates in a few columns then specify only those columns in your merge statement. Don't mention all columns.
  4. Avoid unnecessary clustering on multiple columns.




Let's start with the most interesting optimization from all these i.e. Partition Pruning.


If you have observed the above example of MERGE then we have one column named `created`. Previously our table was not partitioned on any column. Now let's suppose instead of a simple BigQuery native table if we create a partition table applying partitioning on the created column then Partitioning will significantly speed up merge operations as it reduces the amount of data scanned during the process.


Let's see by example.

No alt text provided for this image


But you will say, ohh Ankur ?? I know about partitioning & I know can it reduce the scanning but what is this Pruning?


Let's try to understand Pruning by applying it directly.


No alt text provided for this image


If you observe the MERGE statement carefully and compare it with the previous MERGE statement then there are basically two differences.

  1. ?We have precomputed date_min and data_max in a separate query. Observe the DECLARE statement and SET statement carefully.
  2. We have provided date_min and data_max as static values in the MERGE sql ON clause.

This will be huge optimization & due to the fact that the BigQuery prune partitions based on static predicates rather than dynamic predicates. You will easily observe that the bytes scanned, bytes shuffled, slot time consumed and elapsed time are a lot lower than the previous MERGE statement.

This concept applies not only to MERGE operations but also to other operations like join, filter etc.


I hope now you are clear on how Partition Pruning works in BigQuery MERGE statement and how it can optimize your query.




Now let's try to understand how clustering helps during MERGE.


Clustering in BigQuery helps improve the efficiency and performance of the MERGE operation by physically organizing data in a more ordered manner on disk. Clustering works in conjunction with partitioning and involves the rearrangement of data within partitions based on the values of one or more chosen columns, known as clustering columns.


When data is clustered on specific columns, it benefits queries that involve filtering, joining, and sorting based on those columns. In the context of the MERGE statement, clustering can speed up the matching process, reducing the amount of data that needs to be scanned during the merge operation.


Let's go through an example to illustrate the concept of clustering in the context of a MERGE operation:


Now we know that we have the following tables;

  1. Target table - target_table_partitioned (partitioned on the created field)
  2. Source table - source_table


Now, let's create a clustered version of the `target_table`, clustering it on the `id` column:

No alt text provided for this image


Now we will apply the same MERGE statement with the same Partitining Purning but this time target table will be target_table_partitioned_clustered instead of target_table_partitioned table. Now, let's try to understand how this clustering on top of partitioning purning will do wonders.


With clustering applied, the `MERGE` operation can benefit from the improved data organization on disk. During the matching process, data with the same `id` values are more likely to be located together, reducing the amount of data scanned and improving the overall performance of the merge operation. This is especially advantageous when the `id` column is used as a frequent filter for the `MERGE` condition or when the `id` values have a non-uniform distribution across the partitions.




This article has gone long and I think points three and four are self-explanatory I would just love it if one of my readers actually tried to explain these two points in the comment sections.


Feel free to subscribe to my YouTube channel i.e.?The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.

More so, thank you for that most precious gift to a me as writer i.e. your time.

Anirudh Analiparambil

AVP - Data Engineering & Analytics at JPMorgan Chase & Co. | 2x Microsoft Azure Certified | 1x Google Cloud Certified | Ex - Citrix

1 年

Great share

回复
Ramaiah G

Solutions Architect | Data & AI | at Takeda

1 年

BigQuery is ideal solution if your business deals with spiky workloads or rapid queries few times a day. it is much simpler to use than Redshift. probably, any business considers the price, so, redshift is little easier to forecast pricing due to its on-dmeand by-the-hour nature. BigQuery pricing model is complicated as pricing is based on analysis type, storage, data ingestion and extraction and eventually have to put or follow cost optimization mechanism at storage, compute and extraction/Ingestion level.

Murali Krishna

Director, Data Solutions and Products, Data and AI Platform @VISA

1 年

4+ years ago, I had to defend why I am a fan of BigQuery. I'ld recommend go to the internals of it. It's very detailed on how it achieves the performance at scale. Further, I liked the partition tables. It helped us put a proper data warehouse for time based analytical use case.

Abhimanyu Mitra

Data Engineer & Architect | Creating Advanced Data Lakes & Platforms for Better Insights | Driving Business Success with Innovative Solutions

1 年

This will work with all platforms supporting delta.

要查看或添加评论,请登录

Ankur Ranjan的更多文章

  • Apache Arrow Flight

    Apache Arrow Flight

    A Few Days Ago During a conversation with my lead, Karthic Rao , at e6data , I was introduced to a fascinating…

    22 条评论
  • Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

    Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

    What is Apache Kafka? Why is Kafka considered an event streaming platform & more over what does an event actually mean?…

    6 条评论
  • Spark Dynamic Resource Allocation

    Spark Dynamic Resource Allocation

    One of Spark's great features is the support of dynamic resource allocations. Still, with my experience in the last 5…

    6 条评论
  • Intro to Kafka Security for Data Engineers - Part 1

    Intro to Kafka Security for Data Engineers - Part 1

    I have a story about Kafka and Data Engineers that I'd like to share. In the world of Data Engineering, there are two…

    10 条评论
  • Apache Hudi: Copy on Write(CoW) Table

    Apache Hudi: Copy on Write(CoW) Table

    As Data Engineer, we frequently encounter the tedious task of performing multiple UPSERT(update + insert) and DELETE…

    11 条评论
  • Solve Small File Problem using Apache Hudi

    Solve Small File Problem using Apache Hudi

    One of the biggest pains of Data Engineers is small file problems. Let me tell you a short story and explain how one of…

  • Data Swamp - A problem arises due to the love life of Data Engineers.

    Data Swamp - A problem arises due to the love life of Data Engineers.

    Philosophy and the cycle of love even hold in the world of Data Engineering. Let me help you understand how the love of…

    2 条评论
  • Supercharging Apps with Polyglot Persistence: A Simple Guide

    Supercharging Apps with Polyglot Persistence: A Simple Guide

    After working for more than 4 years on Data Intensive applications in a startup, consultancy and product-based…

    4 条评论
  • Stateful transformations in Spark Streaming - Part 1

    Stateful transformations in Spark Streaming - Part 1

    In the previous article of this series i.e.

    7 条评论
  • Kafka for Data Engineers

    Kafka for Data Engineers

    Kafka is the prominent queuing system which is the most used technology in all streaming solutions. In most of my…

    6 条评论

社区洞察

其他会员也浏览了