Microsoft Fabric - Google BigQuery

Microsoft Fabric - Google BigQuery

In today's data-driven world, organizations often find themselves with vast amounts of data scattered across numerous platforms and systems. This fragmented data landscape poses significant challenges, as valuable insights are often hidden within silos, making it difficult to synthesize a comprehensive view. To harness the true potential of their data, it is imperative for organizations to centralize these disparate data sources, enabling efficient analysis and decision-making.

As a Cloud Solution Architect (CSA) at Microsoft, I am responsible for designing cloud-based solutions that unify and streamline data management across an organization. Supporting customers in leveraging our technology to achieve more ensuring that data is accessible, secure, and optimally structured for analysis.

I recently exchanged with a few customers having their data stored in Google Cloud Platform (GCP) within BigQuery principally and analyzing the data within Power BI. This architectural model appears to be more adopted within French market than worldwide.

In this context I have assessed the existing data environments, identified integration points, and tried different strategies for seamless data migration and interoperability. My goal is to enable a cohesive data ecosystem where disparate data sources are harmonized, allowing for robust business intelligence and analytics capabilities.

BigQuery is a serverless data warehouse, for the purpose of my tests, I have loaded to it the following open data: Most Popular Google Search Terms now Available in BigQuery .

The data contain daily top 25 terms in the United States with score, ranking, time, and designated market area. The data is partitioned by day and has a size of approximately 2.83 GB.

We will, among other things, compare these specific queries:

The Data is Stored within the US and I have utilized a Fabric Capacity in West US 3 region (same location as my default tenant region).

In this article I will examine various solutions and evaluate their advantages and disadvantages:

  • 1 - Power BI
  • 1.1 - Import Mode
  • 1.2 - Direct Query
  • 1.3 - Optimizations
  • 2 - OneLake Storage
  • 2.1 - Notebook: Code First approach
  • 2.2 - Dataflow Gen 2: Low Code approach
  • 2.3 - Optimizations
  • 3 - Google Cloud Storage
  • 3.1 – Google Job
  • 3.2 – Optimizations

??? Disclaimers: I have utilized this simple sample data containing only one table, but please be aware that not all data are identical in terms of size, granularity, and complexity. Do not take conclusions drawn from this sample as definitive. Instead, conduct your own tests on your specific data, considering your individual business needs, requirements, domain knowledge, language, and technology. Also, I would like to mention that I am not an expert on Google BigQuery, I have more knowledge on Microsoft Data services.        

1 - Power BI

1.1 - Import Mode

Import Mode is generally preferred for performance purposes >>> The very best mode for analyzing data. This mode brings data into Power BI, allowing it to be stored in a highly optimized, compressed format within the Power BI service. This results in faster query responses and improved performance when analyzing large datasets. Import Mode is particularly beneficial when dealing with complex calculations, transformations, and large volumes of data, as it leverages the powerful in-memory engine of Power BI.

At the time of writing two specific connectors to access BigQuery data are available for Power BI desktop and Dataflow Gen 2: ?

-????????? Google BigQuery connector

-????????? Google BigQuery (Microsoft Entra ID) currently in Beta, it let users sign into Google BigQuery using their Microsoft Entra ID.

I have created a simple semantic model in import mode with the Google BigQuery connector:

I have published the report to the service, the refresh took around 9 min to complete:

I have looked at the refresh consumption of CU from the Metrics App, the refresh consumed around 12k CU (s):

I have queried the data from DAX Studio using XMLA Endpoint, with a cold cache it took 17 milliseconds:

With hot cache it took 16 milliseconds:

?? Again, this is a very simple query, and you should try on your own data.

It is crucial to evaluate the trade-offs, including storage limitations depending on the Capacity SKU (e.g., 25 Gb for F64) or restrictions on data refresh. Can your business users tolerate this refreshing period before accessing the data? Organizations need to assess their unique use cases and need to determine the most suitable approach.


1.2 - Direct Query

Direct Query mode in Power BI allows users to connect directly to a data source without importing the data into Power BI storage. This method provides several advantages, including real-time data access, the ability to handle large datasets that might exceed Power BI's import capacity, reuse underlying security of the data sources and the data is not duplicated.

I have created a simple semantic model in Direct Query mode with the Google BigQuery connector:

I have queried the data from DAX Studio using XMLA Endpoint, with cold cache it took 1220 milliseconds:

?? Query caching is only available on Power BI Premium or Power BI Embedded, for Import semantic models. It is not applicable DirectQuery or LiveConnect semantic models.

I reviewed the query execution details from BigQuery, the execution took 1 second, and the bytes billed amounted was 970 MB (Estimate and control costs ?|? BigQuery ?|? Google Cloud ):

Not having to spend time on the Refresh can be very tempting but Direct Query also comes with its limitations. Performance can be affected by the underlying data source's speed and load, as every interaction in Power BI triggers a query to the data source. There are also restrictions on the types of transformations that can be applied within Power BI, and some advanced Power BI features may be unavailable. Organizations should carefully consider these factors when choosing Direct Query mode, evaluating whether the benefits of real-time data access outweigh the potential performance trade-offs and feature limitations as well as cost generated by all the end user’s query (One Power BI Visuals can generate more than one query).


1.3 - Optimizations

Some optimizations could be implemented for both previous solutions:

  • Optimizing the underlying data source: BigQuery Optimize query computation
  • Reducing network latency
  • Incremental Refresh
  • Composite Model
  • Manual or/and automatic Aggregation
  • Optimizing the Capacity based on SKU Limits
  • Evaluation configuration settings:
  • Maximum number of active connections per source
  • Maximum number of concurrent jobs: The Default value is 6 but it can be configured between 1 and 30
  • Large semantic model storage format
  • Assume Referential Integrity Relationship


2 – OneLake Storage

Microsoft Fabric offers numerous advantages that streamline and enhance data management and analytics. A unified platform for your data gravity across your organization to ground your AI on your data:

OneLake unify data storage across different environments, facilitating seamless data sharing and collaboration.

Direct Lake mode for Power BI enables direct access to data stored in OneLake without the need for intermediate data movement, this results in faster data insights and more responsive data-driven decision-making processes.

Mirroring simplifies linking external databases into Fabric, with full replicas created with just a couple of clicks. Once a database is mirrored, real-time updates will automatically be replicated into OneLake and stored as Delta Parquet. Example of the use of Mirroring and Direct Lake versus the use of Direct Query Native connector:

Source : Announcing the General Availability of Mirroring for Snowflake in Microsoft Fabric But... the mirroring is not supported for BigQuery ?

Let's discuss some of the challenges that arise with the two prior solutions (Import mode and Direct Query mode):

  • In import mode, the data is duplicated
  • In both modes, the data can be enriched and transformed differently within multiple distinct semantic models
  • In either mode, incorporating other workloads such as Streaming, Data Integration, AI, and others may be challenging (unless you use libraries in PySpark like Sempy or OneLake integration for semantic models ).

BigQuery's proprietary format doesn't work with Power BI's native data engine. As a result, Power BI must use older methods like APIs to access BigQuery data, leading to performance issues, data limits, and extra costs. These issues present security and governance obstacles, resulting in multiple versions of the truth and overwhelming Google's data sources with numerous queries aside the fact that advanced experiences like the use of the Copilots are limited.

Let's explore two different methods to overcome these limitations by importing data into OneLake.


2.1 - Notebook: Code First approach

BigQuery Sync (BQSync) project provide an accelerator to help synchronize or migrate data from Google BigQuery to Microsoft Fabric. The primary target use cases for this accelerator are:

  • BigQuery customers who wish to continue to leverage their existing data estate while optimizing their Power BI experience and reducing overall analytics TCO
  • BigQuery customers who wish to migrate all or part of their data estate to Microsoft Fabric

I have uploaded the Installer Notebook to an empty workspace and created a new empty Lakehouse where I have uploaded the GCP Service Account credential Json file to OneLake:

I have attached the Lakehouse and run the entire Notebook. I have then opened the new imported Notebook within the same workspace “BQ-Sync-Notebook”, I have attached the Lakehouse and uploaded an edited version of the “fabric_bq_config.json” file:

I have executed the entire Notebook:

I have looked at the refresh consumption of CU from the Metrics App, the refresh consumed around 10k CU (s):

??? While using a Notebook along with several libraries (Loading data from Google Big Query by Jakub Vála ) would allow me to import data to OneLake, the BQSync solution offers additional features such as:

  • Control table to manage ingestion of in scope BigQuery tables is created in Fabric
  • One single pipeline is parametrized to ingest/manage data identified within the control table
  • Metadata driven ingestion framework to identify and manage data sync from BigQuery to Fabric
  • Sync modes: Full Load, Watermark, and Partitions
  • Optimal load types of overwriting, append and merge are identified via BigQuery Table metadata
  • Query parallelism

Some advanced examples: FabricBQSync/Setup/Config/fabric-bq-config.json at main · microsoft/FabricBQSync ( github.com )

I have verified the number of rows within the Lakehouse table:

I have created a custom semantic model in Direct Lake mode:

After a Framing ?to evict resident columns from memory, I have queried the data from DAX Studio using XMLA Endpoint, with cold cache it took 141 milliseconds:

With hot cache it took 17 milliseconds:

?? When a query is made to a Direct Lake semantic model, it may revert to Direct Query mode. This allows it to fetch data directly from the SQL analytics endpoint of the Lakehouse or Warehouse. Such queries always provide the most current data since they are not limited by the timeframe of the last framing operation. A query?always?falls back when the semantic model queries a view in the SQL analytics endpoint, or a table in the SQL analytics endpoint that?enforces row-level security (RLS) . Also, a query might fall back when the semantic model?exceeds the guardrails of the capacity . Whenever feasible, you should design your solution or scale your capacity to prevent Direct Query fallback, as it can lead to reduced query performance.

?? Fabric supports an additional optimization called?V-Order . V-Order is an optimization technique applied during the writing process of the Parquet file format. Implementing V-Order results in a smaller file size, which consequently enhances read speed. This optimization is particularly beneficial for Direct Lake semantic models as it readies data for rapid memory loading, thus reducing the burden on capacity resources. Additionally, it improves query performance by minimizing the amount of memory that needs to be scanned. Delta tables created by Fabric items like data pipelines, dataflows, and notebooks automatically use V-Order. However, Parquet files uploaded to a Fabric Lakehouse or referenced by a shortcut might not include this optimization. While these non-optimized Parquet files are readable, their performance may be slower compared to files with V-Order. It's worth noting that parquet files with V-Order still follow the open-source Parquet format, so they are readable by non-Fabric tools. For more information, see?Delta Lake table optimization and V-Order .

?

2.2 - Dataflow Gen 2: Low Code approach

Microsoft Fabric Dataflow Gen 2 provides a low-code interface with a visual tool for creating multi-step data ingestion and transformation via Power Query Online.

I have created a simple Dataflow Gen 2 to import the table using the Google BigQuery connector:

The refresh took around 9 min to complete:

The Framing of the custom semantic model built on top of the table from the Lakehouse took 2 seconds:

I have looked at the Dataflow Gen 2 refresh consumption of CU from the Metrics App, the refresh consumed around 9 k CU (s):

?

2.3 – Optimizations

Some optimizations could be implemented for both previous solutions:

  • BigQuery Optimize query computation
  • Network
  • Incremental Refresh
  • Capacity (SKU Limits )
  • Dataflow Gen2 staging
  • Metadata Table Maintenance
  • Delta Table Maintenance (Number of parquet files)
  • Best performance with Vorder
  • Assume Referential Integrity Relationship
  • Direct Lake guardrails (Limits )
  • Strive to reduce cardinality (the number of unique values) in every column of each Delta table
  • As with any data table, Delta tables should only store columns that are required
  • The?OPTIMIZE?command can also apply V-Order to compact and rewrite the Parquet files
  • You can use?VACUUM ?to remove files that are no longer referenced and/or that are older than a set retention threshold
  • Fast copy in Dataflows Gen2 not available for Google Connector ??
  • ...

?

3 – Google Cloud Storage

Except for the Direct Query mode, all previous solutions involve duplicating data outside Google Cloud Platform. Given customer constraints, this might not always be feasible. Therefore, why not leverage Microsoft Fabric's Shortcut features to access Delta Parquet files stored in Google Cloud Storage (GCS)?

3.1 – Google Job

The solution utilizes GCP's native data platforms to export BigQuery data into Delta Parquet format, which is then stored in a GCS bucket. Microsoft Fabric can natively read the data from the Cloud Storage bucket, enabling seamless integration with Power BI models. This enables you to take advantage of the advanced data analysis techniques offered by Power BI, such as Power BI's Copilot, while also allowing access to other Microsoft Fabric experiences with their respective Copilots.

Following BigQuery-to-Fabric steps by Joey Brakefield I have created a Dataproc cluster and submit a Job on it to export the Delta Parquet file to GCS Bucket. The Job execution took around 2 minutes:

I have created a shortcut to the GCS Bucket: Create a Google Cloud Storage(GCS) shortcut

Over time, the increasing size of source data could make this inter Cloud dependency quite expensive. Microsoft Fabric's caching ?and Delta Parquet’s inherent metadata-read-first features help to lower the expenses associated with cross-cloud egress charges.

?? Shortcut caching is currently only supported for GCS, S3 and S3 compatible shortcuts and it’s disabled by default:

I have created a custom Semantic model:

I have queried the data from DAX Studio using XMLA Endpoint, with cold cache and Shortcut cache disabled it took 2970 milliseconds:

With hot cache and Shortcut cache disabled, it took 17 milliseconds:

I have activated the Shortcut cache and queried the data again after cleaning the semantic model cache with a refresh involves a framing operation and it took 17 milliseconds:

?? The Delta Parquet files stored in GCS lack the V-Order optimization, making them less efficient for reads operations.

?

3.2 - Optimizations

Some optimizations could be implemented:

  • BigQuery Optimize query computation
  • Incremental Refresh
  • GCP Cluster Optimizations
  • GCP Job Optimizations
  • GCP Metadata Table Maintenance
  • GCP Delta Table Maintenance
  • Network
  • Capacity (SKU Limits )
  • Assume Referential Integrity Relationship
  • Direct Lake guardrails (Limits )
  • ...

??

4 - Conclusion

We have seen different solutions to enable efficient analysis and decision-making analyzing GCP data in Power BI

While Import Mode method is ideal for static datasets that do not require frequent updates. Direct Query is particularly useful for large datasets or scenarios requiring up-to-date information. The Composite Model offers a flexible solution, allowing the combination of import and direct query methods to leverage the strengths of both.

The BQSync solution offers a code-centric approach for synchronizing Google Cloud Platform data with OneLake, whereas Dataflow presents a low-code alternative. Both are compelling options to augment analytics workloads, including the application of AI skills and other advanced functionalities. Additionally, they provide centralized security and monitoring because once they're configured on OneLake, all other workloads adhere to the established security protocols and centralized logging. The total cost of the entire solution is known upfront, with only a few incremental queries sent to GCP.

Lastly, the Shortcut solution offers an in-between alternative keeping the data in the GCP platform while allowing the use of advanced analytics workloads in Microsoft Fabric.

Comments on the previous table:

  • The expenses in GCP will increase when there is a high volume of queries
  • The cost in Fabric will increase based on the size of the semantic model in import mode or based on the size of the compressed data within the OneLake
  • For optimal performance in querying, Import Mode is still the best choice. GCP Shortcut does not provide V-Order optimization, unlike the BQSync and Dataflow solutions
  • Only Direct Query mode prevents data duplication
  • API Request refer to the Microsoft Fabric API for GraphQL capabilities
  • Analytics Workloads refer to other analytics experience available from Microsoft Fabric where the best is to have the data stored in OneLake.
  • SemPy : Semantic Link Overview for Python or OneLake integration for semantic models

?? Please don’t take my conclusions (gently bias based on my expertise) for granted but instead do your own tests on your own data.

?? With BigQuery tables for Apache Iceberg currently in Preview, new alternatives may become available : BigQuery tables for Apache Iceberg ?|? Google Cloud

Thank you to Alex Antonelli for assisting me with the BQSync solution and to Eric Nguyen Phu Hung for sharing his customers' insightful questions, which contributed to this quick analysis.

Romain Casteres

Cloud Solution Architect @ Microsoft | Unlocking Customer Data Value | Learn & Share

1 个月

?? Nous avons enregistré les démonstrations dans une vidéo sur la cha?ne YouTube Data Chouette : https://www.youtube.com/watch?v=_OSrZ7CMAxM

回复
Jaime G.

Empowering Business Growth through Data: Data Analyst & BI Developer | Unlocking Insights with Tableau, SQL, and PowerBI | Let's Discuss Data Viz, Analytics, and ETL Strategies! #DataAnalytics #BusinessIntelligence #AI

1 个月

Amazing collaboration between two intuitive tools and Kudos to you Romain Casteres

Fanny S.

Tech Lead Power BI @BPCE

1 个月

Vincent MUCCHIELLI , Raphael Cardoso assez rare de trouver un article sur le sujet depuis le temps que je vois des projets GCP/Power BI. Il y’a des points en commun avec la matrice de décision que nous connaissons mais ici il y’a surtout une une nouvelle vision avec Fabric. Merci Romain Casteres . Les commentaires sont aussi très interessants ??

Firdavs Alinazarov

Data Analyst & BI Expert @ H&M Group

1 个月

Valuable info, thanks!

Jonas Mayer

Datenmanagement und Datenanalyse im WDR || F?rderer von Datenkultur und -literacy || #PowerBI <3

1 个月

This is exactly what I need. Thanks.

要查看或添加评论,请登录

Romain Casteres的更多文章

社区洞察

其他会员也浏览了