Big Query - Optimization V1

Big Query - Optimization V1

Coming form SQL background when I started working with Google BigQuery it was really amazing to see the performance. The server less architecture of BigQuery processes TBs of data in couple of seconds only.

Though Big Query uses syntax which are very similar to SQL, it actually takes a significantly different procedure for processing large amounts of data. As the billing depends upon the size of the data scanned we need to minimize the data to scan. Below are some tips we can use while querying the datasets.

The topic is covered with scope of "Standard SQL" of Big Query and working on "WEB UI" interface.

1. Specify column names in the 'select' clause

Rather than using 'select * from....' to read the data it is always advisable to specify the column list e.g. 'select col1, col2, col3,..... from... '. We can see the difference in the below screen shot. The table which is accessed is same in both the queries.

No alt text provided for this image

2. Using 'Sub Queries' or 'Intermediary Tables' properly

Use of sub queries can be done with the help of "with" clause. Using the "with" clause to define multiple sub queries from the beginning of the select statement will help if the same sub queries will be used many times inside the actual select statement.

No alt text provided for this image

The above will help when the sub query content is generally satisfying below conditions

  • Only related to the specific query or select statement
  • Scans less number of tables without complex calculations.
  • Not using aggregations nor any filter conditions to result very less number of rows than the tables or datasets it is working on to produce the result.
  • When the query is not run multiple times a day to scan significant amount of data.

But, when we need to use a particular type of sub query which is either reading from multiple underlying tables and views and resulting only a few records or being called many times a day to scan a lot of data then we should better use Intermediary Tables to store the result of the sub query. We can refresh the intermediary table by schedules according to our usage. The below query for e.g. scans around 5 GB data every time it runs. We can create an Intermediary table in our data set with expiration time to delete the table if not in use.

No alt text provided for this image

And when we scan the result output of it will be noteworthy as below.

No alt text provided for this image

3. Using "Shard Table" and "Partitioned Table"

Shard tables are nothing but grouped tables which starts with same names. We generally keep all the tables with suffix as dates. In general these tables are used to keep snapshots of everyday so that all the tables can be grouped as one single entity together. By using the wildcard selection we can select the data from all the underlying tables.

The advantage on shard tables over partitioned tables (on date) is when we need to select only particular day wise, month wise or year wise data we don't need the filter condition rather by just using wild card selection we can do it. (The below shard table stores snapshot of a view with date suffix added everyday)

No alt text provided for this image

If the datatype of columns changes time to time or the schema of the table changes time to time it will be hard to extract data from the set of shard tables.

Even if partitioned table with the same data for the same date range provides same performance we need it when the date ranges are not monthly or yearly as mentioned above for shard tables. It adds more flexibility while selecting the date range.

The 'where' clause may not play role in performance when we scan normal tables but it does when we scan partition tables with specific date range.

4. Clustering The Table

We can now store the data in a table in 'sorted' manner with some specific columns. And if we try to read the data filtered by those columns then Big Query is smart enough to touch only those matching clusters making the scan cheaper and faster. Clustering is done on the partitioned tables only for the time being and it is advisable to cluster on those columns which are frequently grouped or filtered. For the time being it is supported only on Partitioned tables with date type column used as partition.

Here is one example where I have created same data set as 2 different tables one as clustered with 2 columns which later I will filter by and other table with only partition.

No alt text provided for this image

If we try to scan the data in same manner as below we may see the to be processed data as 2 GB. But if we see the result after it is run we can see the difference.

No alt text provided for this image

To see the difference we may query the audit log of Big Query as below.

No alt text provided for this image

The deference in the clustering method will be more when we scan the data in TBs or the clustered group has less number of rows. The lesser the group or cluster has the lesser will be the scan over only partitioned table.





要查看或添加评论,请登录

Satyam Somadutta的更多文章

社区洞察

其他会员也浏览了