Optimize Google BigQuery
I love BigQuery and think It is one of the best products ever made by the Google Cloud Platform.
As someone who works in data engineering, I have found BigQuery to be one of the best products offered by the Google Cloud Platform. One feature I frequently use is the MERGE statement, which allows for UPSERT (Update + Insert) in BigQuery Native tables.
Having spent over a year using BigQuery at work, I've discovered several effective ways to optimize the merge statement. In this discussion, I'll share each of these strategies and explain why they have proven helpful in my experience.
1. What is MERGE in BigQuery?
Let's first understand what is MERGE in the context of Google BigQuery and what's better than understanding it by one example.
Sure! Below is an example of a BigQuery MERGE statement, also known as an upsert operation. In this example, we have two tables: source_table and target_table. We want to merge data from source_table into target_table based on a common key column (id).
Assume the schema for both tables is as follows:
source & target table:
- id INTEGER
- name STRING
- age INTEGER
- programming STRING
- created DATE
- updated DATE
The goal is to update existing rows in target_table if the `id` already exists or insert new rows if the `id` doesn't exist in target_table.
Explanation:
By running this MERGE statement, you can efficiently synchronize data between the `source_table` and `target_table`, ensuring that any updates or new records are correctly handled. It is very beneficial when you are running your pipeline daily or even streaming your data using Apache Spark or any other computing engine and want to keep only the latest updated data in your final BigQuery native table.
2. Ways to optimize MERGE
Now let's try to understand different ways to optimize MERGE in BigQuery & and before optimizing anything we should ask need of optimizing it questions to ourselves.
Enough of introduction, theory & philosophy :) Let's jump to the real thing i.e. our Engineering of optimization!!! Let's note down the point and then we will try to understand each and every optimization in detail.
Let's start with the most interesting optimization from all these i.e. Partition Pruning.
If you have observed the above example of MERGE then we have one column named `created`. Previously our table was not partitioned on any column. Now let's suppose instead of a simple BigQuery native table if we create a partition table applying partitioning on the created column then Partitioning will significantly speed up merge operations as it reduces the amount of data scanned during the process.
Let's see by example.
领英推荐
But you will say, ohh Ankur ?? I know about partitioning & I know can it reduce the scanning but what is this Pruning?
Let's try to understand Pruning by applying it directly.
If you observe the MERGE statement carefully and compare it with the previous MERGE statement then there are basically two differences.
This will be huge optimization & due to the fact that the BigQuery prune partitions based on static predicates rather than dynamic predicates. You will easily observe that the bytes scanned, bytes shuffled, slot time consumed and elapsed time are a lot lower than the previous MERGE statement.
This concept applies not only to MERGE operations but also to other operations like join, filter etc.
I hope now you are clear on how Partition Pruning works in BigQuery MERGE statement and how it can optimize your query.
Now let's try to understand how clustering helps during MERGE.
Clustering in BigQuery helps improve the efficiency and performance of the MERGE operation by physically organizing data in a more ordered manner on disk. Clustering works in conjunction with partitioning and involves the rearrangement of data within partitions based on the values of one or more chosen columns, known as clustering columns.
When data is clustered on specific columns, it benefits queries that involve filtering, joining, and sorting based on those columns. In the context of the MERGE statement, clustering can speed up the matching process, reducing the amount of data that needs to be scanned during the merge operation.
Let's go through an example to illustrate the concept of clustering in the context of a MERGE operation:
Now we know that we have the following tables;
Now, let's create a clustered version of the `target_table`, clustering it on the `id` column:
Now we will apply the same MERGE statement with the same Partitining Purning but this time target table will be target_table_partitioned_clustered instead of target_table_partitioned table. Now, let's try to understand how this clustering on top of partitioning purning will do wonders.
With clustering applied, the `MERGE` operation can benefit from the improved data organization on disk. During the matching process, data with the same `id` values are more likely to be located together, reducing the amount of data scanned and improving the overall performance of the merge operation. This is especially advantageous when the `id` column is used as a frequent filter for the `MERGE` condition or when the `id` values have a non-uniform distribution across the partitions.
This article has gone long and I think points three and four are self-explanatory I would just love it if one of my readers actually tried to explain these two points in the comment sections.
Feel free to subscribe to my YouTube channel i.e.?The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.
More so, thank you for that most precious gift to a me as writer i.e. your time.
AVP - Data Engineering & Analytics at JPMorgan Chase & Co. | 2x Microsoft Azure Certified | 1x Google Cloud Certified | Ex - Citrix
1 年Great share
Solutions Architect | Data & AI | at Takeda
1 年BigQuery is ideal solution if your business deals with spiky workloads or rapid queries few times a day. it is much simpler to use than Redshift. probably, any business considers the price, so, redshift is little easier to forecast pricing due to its on-dmeand by-the-hour nature. BigQuery pricing model is complicated as pricing is based on analysis type, storage, data ingestion and extraction and eventually have to put or follow cost optimization mechanism at storage, compute and extraction/Ingestion level.
Software engineer
1 年Cf
Director, Data Solutions and Products, Data and AI Platform @VISA
1 年4+ years ago, I had to defend why I am a fan of BigQuery. I'ld recommend go to the internals of it. It's very detailed on how it achieves the performance at scale. Further, I liked the partition tables. It helped us put a proper data warehouse for time based analytical use case.
Data Engineer & Architect | Creating Advanced Data Lakes & Platforms for Better Insights | Driving Business Success with Innovative Solutions
1 年This will work with all platforms supporting delta.