BigQuery

BigQuery

Partitioning vs Clustering

It's possible to combine both partitioning and clustering in a table, but there are important differences between both techniques that you need to be aware of in order to decide what to use for your specific scenario:

No alt text provided for this image

You may choose clustering over partitioning when partitioning results in a small amount of data per partition, when partitioning would result in over 4000 partitions or if your mutation operations modify the majority of partitions in the table frequently (for example, writing to the table every few minutes and writing to most of the partitions each time rather than just a handful).

BigQuery has?automatic reclustering: when new data is written to a table, it can be written to blocks that contain key ranges that overlap with the key ranges in previously written blocks, which weaken the sort property of the table. BigQuery will perform automatic reclustering in the background to restore the sort properties of the table.

  • For partitioned tables, clustering is maintaned for data within the scope of each partition.

Best Practices

Cost reduction

  • Avoid?SELECT *?. Reducing the amount of columns to display will drastically reduce the amount of processed data and lower costs.
  • Price your queries before running them.
  • Use clustered and/or partitioned tables if possible.
  • Use?streaming inserts?with caution. They can easily increase cost.
  • Materialize query results?in different stages.

Query performance

  • Filter on partitioned columns.
  • Denormalize data.
  • Use?nested or repeated columns.
  • Use external data sources appropiately. Constantly reading data from a bucket may incur in additional costs and has worse performance.
  • Reduce data before using a?JOIN.
  • Do not threat?WITH?clauses as?prepared statements.
  • Avoid?oversharding tables.
  • Avoid JavaScript user-defined functions.
  • Use?approximate aggregation functions?rather than complete ones such as?HyperLogLog++.
  • Order statements should be the last part of the query.
  • Optimize join patterns.
  • Place the table with the?largest?number of rows first, followed by the table with the?fewest?rows, and then place the remaining tables by decreasing size.
  • This is due to how BigQuery works internally: the first table will be distributed evenly and the second table will be broadcasted to all the nodes. Check the?Internals section?for more details.

要查看或添加评论,请登录

Filipe Balseiro的更多文章

  • Introduction to Streaming - Apache Kafka

    Introduction to Streaming - Apache Kafka

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…

  • Spark - Setting up a Dataproc Cluster on GCP

    Spark - Setting up a Dataproc Cluster on GCP

    Dataproc is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto…

    6 条评论
  • Apache Spark

    Apache Spark

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

    3 条评论
  • DBT- Data Build Tool (Part II)

    DBT- Data Build Tool (Part II)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

    2 条评论
  • DBT- Data Build Tool (Part I)

    DBT- Data Build Tool (Part I)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

    3 条评论
  • DataCamp - Data Engineering with Python

    DataCamp - Data Engineering with Python

    Data Engineers Data engineers deliver: The correct data In the right form To the right people As efficiently as…

  • Youtubers Popularity

    Youtubers Popularity

    Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

    12 条评论
  • Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Case Study: Help a bike-share company to convert casual riders into annual members In this article I showcase my…

社区洞察

其他会员也浏览了