登录查看更多内容

Big Query - Optimization V1

Satyam Somadutta

Architect (Data & Analytics) (GCP | Azure | Data Science | Machine Learning) at LDC

发布日期: 2019年6月12日

Coming form SQL background when I started working with Google BigQuery it was really amazing to see the performance. The server less architecture of BigQuery processes TBs of data in couple of seconds only.

Though Big Query uses syntax which are very similar to SQL, it actually takes a significantly different procedure for processing large amounts of data. As the billing depends upon the size of the data scanned we need to minimize the data to scan. Below are some tips we can use while querying the datasets.

The topic is covered with scope of "Standard SQL" of Big Query and working on "WEB UI" interface.

1. Specify column names in the 'select' clause

Rather than using 'select * from....' to read the data it is always advisable to specify the column list e.g. 'select col1, col2, col3,..... from... '. We can see the difference in the below screen shot. The table which is accessed is same in both the queries.

2. Using 'Sub Queries' or 'Intermediary Tables' properly

Use of sub queries can be done with the help of "with" clause. Using the "with" clause to define multiple sub queries from the beginning of the select statement will help if the same sub queries will be used many times inside the actual select statement.

The above will help when the sub query content is generally satisfying below conditions

Only related to the specific query or select statement
Scans less number of tables without complex calculations.
Not using aggregations nor any filter conditions to result very less number of rows than the tables or datasets it is working on to produce the result.
When the query is not run multiple times a day to scan significant amount of data.

But, when we need to use a particular type of sub query which is either reading from multiple underlying tables and views and resulting only a few records or being called many times a day to scan a lot of data then we should better use Intermediary Tables to store the result of the sub query. We can refresh the intermediary table by schedules according to our usage. The below query for e.g. scans around 5 GB data every time it runs. We can create an Intermediary table in our data set with expiration time to delete the table if not in use.

And when we scan the result output of it will be noteworthy as below.

3. Using "Shard Table" and "Partitioned Table"

Shard tables are nothing but grouped tables which starts with same names. We generally keep all the tables with suffix as dates. In general these tables are used to keep snapshots of everyday so that all the tables can be grouped as one single entity together. By using the wildcard selection we can select the data from all the underlying tables.

The advantage on shard tables over partitioned tables (on date) is when we need to select only particular day wise, month wise or year wise data we don't need the filter condition rather by just using wild card selection we can do it. (The below shard table stores snapshot of a view with date suffix added everyday)

If the datatype of columns changes time to time or the schema of the table changes time to time it will be hard to extract data from the set of shard tables.

Even if partitioned table with the same data for the same date range provides same performance we need it when the date ranges are not monthly or yearly as mentioned above for shard tables. It adds more flexibility while selecting the date range.

The 'where' clause may not play role in performance when we scan normal tables but it does when we scan partition tables with specific date range.

4. Clustering The Table

We can now store the data in a table in 'sorted' manner with some specific columns. And if we try to read the data filtered by those columns then Big Query is smart enough to touch only those matching clusters making the scan cheaper and faster. Clustering is done on the partitioned tables only for the time being and it is advisable to cluster on those columns which are frequently grouped or filtered. For the time being it is supported only on Partitioned tables with date type column used as partition.

Here is one example where I have created same data set as 2 different tables one as clustered with 2 columns which later I will filter by and other table with only partition.

If we try to scan the data in same manner as below we may see the to be processed data as 2 GB. But if we see the result after it is run we can see the difference.

To see the difference we may query the audit log of Big Query as below.

The deference in the clustering method will be more when we scan the data in TBs or the clustered group has less number of rows. The lesser the group or cluster has the lesser will be the scan over only partitioned table.

Satyam Somadutta的更多文章

Configure Visual Studio 2019 with SSDT or Install any other extensions.

2019年9月5日

Configure Visual Studio 2019 with SSDT or Install any other extensions.

I have just started using Visual Studio 2019. After facing a little bit challenge I was able to configure it to use…

3 条评论

Big Query - Optimization V1

Satyam Somadutta

Architect (Data & Analytics) (GCP | Azure | Data Science | Machine Learning) at LDC

1. Specify column names in the 'select' clause

2. Using 'Sub Queries' or 'Intermediary Tables' properly

3. Using "Shard Table" and "Partitioned Table"

4. Clustering The Table

Satyam Somadutta的更多文章

社区洞察

其他会员也浏览了

Parquet file format – everything you need to know!

Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2

Streamline SQL Workflow with Snowflake Copilot

The Latest In Distributed SQL - July

Unleashing the Power of AI in SQL Server: A Practical Guide for Data-Driven Insights

Azure Data and Power BI News (September 2022)

Graph Database and Query Language 101: Speed & Simplicity

Essential SQL Tips for Crafting Efficient Queries in Data Engineering

Ingesting data into Databricks SQL

?? Materialized Views vs. Dynamic Views in Databricks SQL

1. Specify column names in the 'select' clause

2. Using 'Sub Queries' or 'Intermediary Tables' properly

3. Using "Shard Table" and "Partitioned Table"

4. Clustering The Table

Satyam Somadutta的更多文章

Configure Visual Studio 2019 with SSDT or Install any other extensions.

社区洞察

其他会员也浏览了

Parquet file format – everything you need to know!

Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2

Streamline SQL Workflow with Snowflake Copilot

The Latest In Distributed SQL - July

Unleashing the Power of AI in SQL Server: A Practical Guide for Data-Driven Insights

Azure Data and Power BI News (September 2022)

Graph Database and Query Language 101: Speed & Simplicity

Essential SQL Tips for Crafting Efficient Queries in Data Engineering

Ingesting data into Databricks SQL

?? Materialized Views vs. Dynamic Views in Databricks SQL