How to Enable a multi-cloud Data Mesh Microsoft Fabric with Google BigQuery
Generated with Microsoft Designer

How to Enable a multi-cloud Data Mesh Microsoft Fabric with Google BigQuery

Introduction

Many of my Health and Life Sciences customers are looking for ways to leverage the power and scalability of Microsoft Fabric while leveraging previous investments into a multi-cloud approach with Google Cloud Platform's (GCP) BigQuery for some of their data analytics needs. This approach can make data feel isolated and enterprises must use the source cloud’s native analytics stack. Enter Microsoft Fabric: Fabric enables data hosted in other cloud platforms appear to be local so that you can continue using native analytics applications, including the world’s most popular data visualization tool, Power BI. Let’s look at how Fabric enables clouds to “talk to each” without the need to migrate data running through an example scenario of creating a Power BI dashboard using both Fabric-sourced and GCP-sourced data in one report with minimal data movement and cost. So, we know we have data in two separate clouds, but how do we “mesh” the data together without incurring extraneous costs?

A data mesh is a methodology that creates data products for different business data domains in a decentralized, agile manner. What we’re going to talk about today is building blocks of building an always up-to-date, two-way data transit system that can populate any data strategy including data hubs, meshes, and fabrics without the need for costly replication. ?

Data methodologies

Microsoft Fabric embraces all of these and gives customers the ability to adopt any or some of these when building out Fabric’s analytics platform by leveraging a hybrid approach.


Fabric supports all data methodologies by default

?

The Scenario

BigQuery (BQ) is a serverless data warehouse; however, some BQ customers also want to use Fabric's comprehensive suite of data and analytics fully integrated experiences that feel disjointed or there aren’t similarly capable products with GCP’s current offering. Here are some of the Copilot powered experiences you’d find in Fabric.

·???????? Power BI – Citizen data model development and data visualization

·???????? Data Engineering/Lakehouse -- Spark, instant start live pools, medallion architecture by default.

·???????? Data Warehousing – Full T-SQL DDL and infinite data modeling for Gold-tier business data products

·???????? Real-Time Analytics - data streaming for IoT devices, telemetry, logs and analyze massive volumes of semi-structured data with high performance and low latency.

·???????? Data Science – end-to-end building, training, deploying, and managing of AI models

?We’ll be focusing mostly on Data Engineering, Warehousing, and Power BI experiences in this blog but know that you can leverage any number of these Fabric experiences with externally accessible, non-proprietary formatted data.

Microsoft Fabric experiences in this article

Since we are focused on delivering our data product of creating a Power BI dashboard as part of our mesh methodology, we already know Power BI can connect to many data sources like BigQuery, but there’s a catch: BigQuery uses a proprietary, non-open-source format that is not compatible with Power BI's native data engine. This means that Power BI must fall back to legacy approaches like APIs to access BigQuery data, which can result in performance issues, data limitations, and additional costs.

The solution

Fortunately, there is a way to overcome this challenge and use Microsoft Fabric with BigQuery data natively, without sacrificing quality, or functionality. The solution involves using GCP's native data platforms to export data from BigQuery to an open format, namely Delta Parquet, which is a columnar storage format that supports schema evolution and efficient data compression. Delta Parquet files can be stored in a Google Cloud Storage (GCS) bucket, which is GCP’s object storage service. Once the data is in the Cloud Storage bucket, Microsoft Fabric can read it natively as if it were local to Fabric, thereby making it available to Power BI models without any issues. This way, you can use all the novel ways to analyze the data that Power BI offers, such as Power BI's Copilot, a feature that uses natural language processing to generate insights and suggestions from your data. This intercloud dependency could be very costly as the source data size balloons over time. Microsoft Fabric caching and Delta Parquet’s native metadata-read-first capabilities reduce the cost of cross-cloud egress charges.

The solution allowing for data to move across clouds

?

The Use Case

To demonstrate how this solution works in practice, I have created a GitHub repository that shows how to use part of the publicly hosted CMS Dataset on BigQuery to augment my peers’ existing GitHub repository. That repository uses Medicare Part D data over many years to present a dashboard integrating data from GCP's BigQuery alongside the data native to Fabric. The CMS Dataset on BigQuery contains various data sets from the Centers for Medicare and Medicaid Services (CMS), such as Medicare claims, provider information, and quality measures; we’re just using a small section, prescribers for 2014, as the interoperability proof of concept. My peers’ Fabric Medicare Part D GitHub repository contains information on prescription drug spending and utilization for Medicare beneficiaries. The dashboard that I have created shows how to compare and contrast the trends and patterns in these data sets using Power BI's visualizations and filters along with data from GCP’s BigQuery.

The Steps

I’ve written a bullet point list of what is defined in detail on my GitHub repo. Please follow this link to find out extra information related to implementation

? NOTE: if you want to replicate this guide exactly, you’ll need to first deploy my peers’ GitHub repository for implementing Medicare Part D on Fabric. ?

·???????? To initialize this POC, I spun up a BigQuery instance using their public dataset browser to reference Medicare Part D prescribers for 2014 inside BQ’s bigquery-public-data dataset ?? cms_medicare ?? part_d_prescriber_2014. I did some simple SELECT statements in BQ just to ensure this was the data I wanted to eventually integrate into the low data movement, mixed-mode Power BI report.

·???????? Next, we spin up a Dataproc cluster, GCP’s managed Spark and Hadoop service that lets you run Apache Spark jobs on GCP. The specifics of this step are critical as Dataproc’s versions. I leaned HEAVILY on this guide and code and adapted my code to pull from the BQ public datasets. Key observation: the libraries mentioned in the original blog are a couple of years old, so I am using the older versions of Dataproc images so that I didn’t have to refactor the code on the fly.

·???????? The next step is to write a Spark job that reads the BigQuery tables that you want to export and writes them to Delta Parquet files in a Cloud Storage bucket. Again, this step is found here. We’ll use the Spark BigQuery connector, which is a library that allows Spark to read and write data from BigQuery. You can also use the Delta Lake connector, which is a library that allows Spark to read and write Delta Parquet files. The PySpark job takes both connectors and either seeds or copies changes from a certain BQ table into Delta format into a GCS bucket.

  • NOTE: This setup allows for parameterized continuous export of locked away, proprietary BQ data into open-source Delta Parquet. Not only does this allow other platforms like Fabric to read the data without ETL, but it also enables a slew of other capabilities in your enterprise namely in the Data Science (ML model training, etc.) & Data Engineering (data caching and/or cross-cloud interconnects to keep multi-cloud costs to a minimum).

·???????? The final step is to connect Microsoft Fabric to the Cloud Storage bucket that contains the Delta Parquet files. You can use the Google Cloud Storage shortcut for OneLake, which is a connector that allows Fabric to access data stored in Cloud Storage natively, to populate a Fabric Lakehouse of your choosing. In my case, I called my lakehouse “CMS_Lakehouse” which was already created from my peers’ repository and my shortcut name was “BQHospitalSampleDeltaExport”. Once you have created a shortcut, you can use default semantic model or bring your Lakehouse data into another model.

·???????? Because we are augmenting the existing model, we will need to update the model generated from the Part D repo to include data from GCP to simulate using data from other clouds in a mixed-mode Power BI report. To do this, we’ll just modify the semantic model to include data from Lakehouse backed by the GCS data over in the Google Cloud.

Conclusion

In this blog post, I have shown you how to use Microsoft Fabric with Google BigQuery data natively by using GCP's Dataproc and Cloud Storage services to export data from BigQuery to an open format, Delta Parquet, and then using Microsoft Fabric to access and analyze the data as if it were native. This solution allows you to leverage the existing investments your organization has made into a multi-cloud environment, BigQuery and the versatility and functionality of Microsoft Fabric. You can also use this solution to integrate data from other sources, such as your own databases, files, or APIs, and create comprehensive and insightful data analytics solutions for your Health and Life Sciences customers. If you want to learn more about this solution, you can check out my GitHub repository that provides the code and the instructions for the use case that I have described. I hope you find this blog post useful and informative, and I welcome your feedback and questions.


Final Dashboard Leverage BigQuery with Native Delta Parquet Caching and Performance

?

Mike Kiser, CSM

Enterprise Architect at Texas Dept of Criminal Justice Azure Architect at T-Mobile - Azure Data Factory / ML & IoT Consultant, Microsoft FABRIC Consultant

10 个月

How do we study Fabric when we cant get access?

回复

要查看或添加评论,请登录

Joey Brakefield, MBA的更多文章

社区洞察