登录查看更多内容

Performance essentials - BigQuery & Distributed data processing systems

Nikhil (Srikrishna) Challa

Cloud, Data & AI expertise ? Google Cloud Champion Innovator??? Authorised Trainer ? Startup advisor ? Writer ? Twitter/X @srikrishna6488

发布日期: 2024年8月14日

As Data Engineers designing and building data platforms, one thing we consistently strive for is cost efficiency and high performance. With most of the data platforms revolving around data warehouses which hosts massive volumes of data, it is important to understand how the compute and storage layers work together to deliver the high performance and cost efficiency. In this blog, we will learn about some important and essential concepts of BigQuery which every engineer and architect should know to design and develop efficient data storage solutions on cloud.

What does it mean to have an isolated compute and storage in modern datawarehouses?

Modern data warehouses like BigQuery, Amazon Redshift etc thrive on the concept of isolating storage and compute. This means that the data does not reside on the same machine where it is processed. This ensures the nodes or workers where the data is processed can scale efficiently and can be destroyed right after the data processing is complete without having to worry about the data. This is mainly because the data is stored elsewhere.

In traditional distributed data processing like Hadoop, the data is required to reside on the same machine where the processing happens to reduce the latency, however BigQuery does not require because the problem of moving the data across the servers is easily done via Google's high speed network called Jupiter which operates at petabit bandwidth

What happens when users submit a query to BigQuery?

When users submit a Query on BigQuery console, BigQuery performs an operations called DRYRUN which evaluates the amount of data the query is going to process.

BigQuery performs the action in 4 distinct steps:

Parsing & Planning - The query is first translated into an internal representation, which involves understanding the syntax, identifying the involved tables, columns and operations. Based on this understanding, BigQuery creates a query plan, which is a blueprint for how the query will be executed efficiently. The query plan outlines the steps involved, including data retrieval, filtering, aggregation, and joining operations.
Distributed execution - BigQuery relies on heavy distributed architecture where the workload is distributed into smaller tasks and executed across multiple machines on google’s data centres. The magic of BigQuery lies in dividing the query as well as data into smaller tasks and smaller shards which will then be processed on independent workers.BigQuery leverages different cores and threads on a single machine to apply various sub tasks of the query on a single shard of data for improved efficiency
Shuffling & Combining results - All the results from the data processed across individual workers are needed to be combined together. Wherever required, BigQuery stores intermediate results until a period of time where they are required and finally are combined across different machines. This process of moving the data across different machines during and after processing is called as ‘Shuffling’
Result Delivery - Once all the partial results are combined, the final output is sent back to the user or application that submitted the query.

Lets look at this by using a SQL example:

Suppose we have a large dataset of web logs stored in a columnar format, and we want to run the following SQL query to find the number of visits per user:

 SELECT user_id, COUNT(*) AS visit_count
FROM web_logs
WHERE status_code = 200
GROUP BY user_id;

Query Execution Process

Query Submission: The query is submitted to the root server. The root server is responsible for parsing the query and initiating its execution.
Query Decomposition: The root server decomposes the query and determines which parts of the dataset need to be accessed. It identifies the relevant shards (or partitions) of the?web_logs?table that contain the data to be queried.
Routing to Mixers: The root server routes the query to a set of intermediate nodes called Mixers. These Mixers are responsible for further distributing the query to the appropriate leaf nodes.
Data Access and Initial Computation (Leaf Nodes): Each leaf node reads the relevant columns (e.g.,?user_id?and?status_code) from its assigned shard. The columnar storage format allows leaf nodes to read only the necessary columns, reducing I/O. Leaf nodes apply the?WHERE?clause (status_code = 200) to filter the data. They perform partial aggregation by counting visits per?user_id?within their shard.
Aggregation and Result Compilation (Mixers): The partial results from the leaf nodes are sent back to the Mixers. The Mixers aggregate these results, combining counts for the same?user_id?from different shards.
Final Aggregation and Result Return (Root Server): The aggregated results from the Mixers are sent back to the root server. The root server performs any final aggregation needed (if the data was split across multiple Mixers) and returns the final result to the client.

Where do Slots which are BigQuery’s computational units come into the picture here?

Slots are BigQuery’s smallest computational units which is a combination of CPU, RAM and network components. Slots are used throughout the query execution process which includes mixer and leaf node level

As we discussed above, BigQuery executes the query by breaking them into number of stages. Each stage of the query is assigned certain number of slots. At the mixer level, slots are used to perform operations such as query rewriting, result aggregation, and data shuffling. Mixers optimize the query and distribute the work to leaf nodes. The slots at this level handle the intermediate processing required before data is sent to the leaf nodes for further execution

Leaf nodes are responsible for reading data from storage (Colossus) and performing initial computations like filtering and partial aggregation. Slots are crucial at this level as they execute the actual data processing tasks.

What is shuffling?

Shuffling is the process of moving or redistributing the data across workers post distributing it for query operations. This is quite common for aggregation scenarios and JOIN scenarios where data is required to be reorganised to ensure the necessary data is available for each worker.

Lets take an example Query:

SELECT customers.customer_id, COUNT(orders.order_id) AS total_orders
FROM customers
JOIN orders ON customers.customer_id = orders.customer_id
GROUP BY customers.customer_id;

Initial data distribution ensures that the orders and customers datasets are distributed across different workers and each worker has subset of data
To perform JOIN operation, BigQuery requires to reshuffle the data to ensure all the records from Customer table and that belongs to same ‘customer_id’ are processed together. This requires shuffling the data so that records with the same key (i.e.,?customer_id) are sent to the same worker. During this process, data is transferred across the network from one worker to another, which can be resource-intensive and time-consuming.
Once the data is shuffled, each worker can perform the join operation locally, matching?customer_id?from both datasets. After the join, the workers perform the?GROUP BY?operation to count the total orders per customer.

How Shuffling impacts performance?

Shuffling can be one of the most resource-intensive operations in a distributed system because it involves significant data movement across the network. This can lead to increased query execution time and higher costs, especially if large volumes of data are involved.

Akshay T. 1 年前

Managing Big Data with Azure Data Lake: Architecture…

ADFAR Tech 1 年前

Redshift vs Bigquery

Darshika Srivastava 1 年前

What are the best practices to minimise Shuffling?

Shuffling as discussed above is a resource intensive process and can lead to significant performance bottlenecks. The simple logic is, if more data is required to move across servers over the network (however fast the network is), the delayed or time taking the processing would be, so the best way to avoid performance bottlenecks due to shuffling is to reduce the data that requires to travel between the workers.

Perform a Group By operation using Integers instead of strings and where possible ensure the column has a high cardinality (more unique values). This ensures lesser possibility of skewing where the distribution of data across the workers is more uniform and the size of integers is smaller than strings which can reduce the overall size of data shuffling between the workers
Use Broadcast JOINS. When there are two tables to JOIN always have larger table on the left hand side of the join because broadcast join involves sending smaller table (right Hand side) to all the workers where larger table is being processed. This will ensure that the table with less volume is shuffled and each worker has smaller dataset in memory which requires to be joined
If possible perform any aggregation and filtering operations earlier before performing JOINS as this will reduce the data that requires shuffling over the network
Consider materialising the results in same temporary table where repetitive query processing is required which helps in avoiding the need for shuffling the same data again and again
By using partitioning and clustering, reduce the amount of data that requires scanning. Partitioning allows for storage of data based on partitions and BigQuery will not require to scan and shuffle the entire data and instead It can select only the data falling within the partition clause
Optimise aggregation queries by delaying aggregations as late as possible. This might appear counterintuitive with point #3, but it’s not. Both of them serve different purposes, where early aggregations before JOINS reduces volume of data that needs to be joined and delaying the aggregations ensures all joins are performed before complex aggregations are applied. Both of them can be used to balance the shuffling and associated performance bottlenecks. The decision to pre-aggregate or delay aggregation depends on the specific use case and the nature of the datasets involved. If the primary goal is to reduce data volume early, pre-aggregation is beneficial. However, if the goal is to ensure data accuracy and completeness after complex transformations, delaying aggregation is more appropriate.

How to evaluate the performance of a Query on BigQuery console?

When users execute a Query, BigQuery immediately displays few performance evaluation metrics on the console on a tab named ‘Execution details’. A screenshot of it is attached below:

As we discussed previously, the SQL query is divided into multiple stages to improve parallelism and we can see how the queries have been divided in the ‘Execution details’ tab. You can observe the same in the screenshot below:

Additionally, users can look at 4 key performance evaluation metrics namely WAIT, READ, COMPUTE, WRITE. Lets understand what each of them mean and what they mean from performance point of view:

Wait Time: Significance: Wait time refers to the duration a stage spends waiting for resources to become available. This can include waiting for slots (computational resources) to be allocated or for data dependencies from previous stages to be resolved. Performance Implications: High wait times can indicate resource contention or inefficient query scheduling. It may suggest that the query is competing with other queries for resources or that there is an imbalance in resource allocation.
Read Time: Significance: Read time is the duration spent reading data from storage. This includes accessing data from BigQuery's underlying storage system, which is optimized for columnar data access. Performance Implications: Long read times can indicate that a large volume of data is being processed or that the data is not well-partitioned or clustered. Optimizing data layout or reducing the amount of data read can improve performance.
Compute Time: Significance: Compute time is the time spent processing data, including executing operations like filtering, joining, aggregating, and transforming data. Performance Implications: High compute times may suggest complex operations or inefficient query logic. Query optimization techniques, such as simplifying expressions or using more efficient functions, can help reduce compute time.
Write Time: Significance: Write time refers to the duration spent writing the results of a stage to temporary storage or final output. This includes writing intermediate results that are used by subsequent stages. Performance Implications: Long write times can indicate bottlenecks in data output, possibly due to large result sets or inefficient data handling. Ensuring that only necessary data is written and optimizing output operations can help reduce write time.

What are show maximum time and Show average time?

The "Show average time" and "Show maximum time" options in the Execution Details tab of BigQuery provide insights into the performance of each stage of a query by highlighting the average and maximum time taken by those stages. By comparing average and maximum times, you can identify stages with high variability in execution time, which might suggest uneven data distribution or resource contention. Understanding these metrics helps in pinpointing stages that could benefit from optimization, such as better data partitioning, query rewriting, or resource allocation adjustments.

Show Average Time

Significance: The average time for a stage indicates the mean duration taken by all workers to complete that stage. It provides a general sense of how long a stage typically takes across the distributed system.
Performance Insights: By examining the average time, you can identify stages that consistently take longer than expected, which might indicate inefficiencies or areas for optimization. It helps in understanding the typical workload and performance characteristics of a stage.

Show Maximum Time

Significance: The maximum time reflects the longest duration taken by any worker to complete a stage. This can highlight outliers or bottlenecks where certain workers are taking significantly longer to process data.
Performance Insights: High maximum times can indicate data skew, where some workers are overloaded with more data than others, or potential issues with resource contention. Identifying these stages can help in diagnosing and addressing performance bottlenecks.

Besides there is another tab called ‘Execution Graph’ which plays an equally important role in visualising the execution in the form of a Graph, the time each stage has taken, slots allocated for that particular stage, realtime monitoring and post execution analysis of the job etc

I hope the above essentials help you in designing and developing efficient data platforms with BigQuery. We will discuss more about the best practices related to reading and writing data to BigQuery in another blog.

Thanks for reading!

Mangesh Soni

GCP Data Analytic Engineer

3 个月

This is very useful!

1 次回应

查看更多评论

要查看或添加评论，请登录

Nikhil (Srikrishna) Challa的更多文章

Compute options on Google Cloud

2024年11月12日

Compute options on Google Cloud

It is essential for the Data Engineers, ML Engineers and Data architects to know about various storage and compute…
Big Data Architectures: Beyond the Classroom

2024年10月28日

Big Data Architectures: Beyond the Classroom

The scale and complexity of Big data architectures at enterprise level is too big compared to what we learn in online…

2 条评论
Real time Data Analytics solution with Spanner Change Streams

2024年7月25日

Real time Data Analytics solution with Spanner Change Streams

A real time streaming analytics solution with Cloud Spanner, BigQuery, Dataflow & Looker Studio If you are aspiring to…

3 条评论
Demystifying the Role of a Data Engineer

2024年4月9日

Demystifying the Role of a Data Engineer

Data engineering is an exciting field to be a part of, but it can come with its own set of challenges. One of the most…
Is cloud spanner underrated?

2024年4月5日

Is cloud spanner underrated?

What is cloud Spanner? Cloud Spanner is one of the google cloud's storage options which is highly available, strongly…

1 条评论
Data Integration patterns for ML/Data Engineers

2024年3月6日

Data Integration patterns for ML/Data Engineers

If you are a data engineer or ML engineer, it is essential to have a good understanding of different data integration…

4 条评论
Managed Instance Groups and Standby Pool

2024年2月23日

Managed Instance Groups and Standby Pool

Q: What are Managed Instance Groups? Managed Instance groups in Google Compute engine are a handy option when it comes…

2 条评论
Essentials of Gemini - The new era of AI

2024年2月21日

Essentials of Gemini - The new era of AI

Over the past few months we have been seeing significant advancements in the field of Artificial intelligence…

4 条评论
Google Cloud Locations & Regions

2024年1月31日

Google Cloud Locations & Regions

Google Cloud launched a new region in Africa for the first time and it is in South Africa. Possibly, a good time to…

2 条评论
Essentials of Streaming analytics for Cloud Data architects & Data Engineers

2024年1月25日

Essentials of Streaming analytics for Cloud Data architects & Data Engineers

There are an increasing number of businesses who started loving and investing in real time data analytics and real time…

2 条评论

See all articles

Performance essentials - BigQuery & Distributed data processing systems

Nikhil (Srikrishna) Challa

Cloud, Data & AI expertise ? Google Cloud Champion Innovator??? Authorised Trainer ? Startup advisor ? Writer ? Twitter/X @srikrishna6488

What does it mean to have an isolated compute and storage in modern datawarehouses?

What happens when users submit a query to BigQuery?

Where do Slots which are BigQuery’s computational units come into the picture here?

What is shuffling?

How Shuffling impacts performance?

领英推荐

What are the best practices to minimise Shuffling?

How to evaluate the performance of a Query on BigQuery console?

What are show maximum time and Show average time?

Nikhil (Srikrishna) Challa的更多文章

社区洞察

其他会员也浏览了

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Sneak Peek into Trino with Azure HDInsight on AKS

Scaling Real-Time Analytics with ClickHouse: Best Practices for Petabyte-Scale Data Management and Cloud Performance

Time Series Database and Analytics using Azure Data Explorer

Exploring Azure Synapse Analytics: Dedicated Pools vs. Serverless Pools

Navigating the Depths of Data Lakes: A Comprehensive Overview

17 Best Data Warehouse Tools

Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval with Source Code

DATA LAKES

What does it mean to have an isolated compute and storage in modern datawarehouses?

What happens when users submit a query to BigQuery?

Where do Slots which are BigQuery’s computational units come into the picture here?

What is shuffling?

How Shuffling impacts performance?

领英推荐

What are the best practices to minimise Shuffling?

How to evaluate the performance of a Query on BigQuery console?

What are show maximum time and Show average time?

Nikhil (Srikrishna) Challa的更多文章

Compute options on Google Cloud

Big Data Architectures: Beyond the Classroom

Real time Data Analytics solution with Spanner Change Streams

Demystifying the Role of a Data Engineer

Is cloud spanner underrated?

Data Integration patterns for ML/Data Engineers

Managed Instance Groups and Standby Pool

Essentials of Gemini - The new era of AI

Google Cloud Locations & Regions

Essentials of Streaming analytics for Cloud Data architects & Data Engineers

社区洞察

其他会员也浏览了

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Sneak Peek into Trino with Azure HDInsight on AKS

Scaling Real-Time Analytics with ClickHouse: Best Practices for Petabyte-Scale Data Management and Cloud Performance

Time Series Database and Analytics using Azure Data Explorer

Exploring Azure Synapse Analytics: Dedicated Pools vs. Serverless Pools

Navigating the Depths of Data Lakes: A Comprehensive Overview

17 Best Data Warehouse Tools

Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval with Source Code

DATA LAKES