Exploring Features and Current Limitations of Microsoft Fabric
This article was co-authored by Arnab Dasgupta and was mentored by Sandeep Subash
1. Overview
Microsoft Fabric is an end-to-end analytics platform that provides a single and integrated environment for data professionals and the business to collaborate on data projects. Fabric provides a set of integrated services that enable us to ingest, store, process, and analyze data in a single environment. Microsoft Fabric integrates with tools the business needs to make decisions. It is a unified software-as-a-service (SaaS) offering, with all the data stored in a single open format in onelake. OneLake is accessible by all the analytics engines in the platform. Microsoft Fabric offers scalability, cost-effectiveness, accessibility from anywhere with an internet connection, and continuous updates and maintenance provided by Microsoft.
Microsoft Fabric covers all the features of existing equivalent technologies and few additional ones those are listed in this article. We do have certain features that are planned for future release from Microsoft and currently unavailable in Fabric which are covered in this article as well.
2. Microsoft Fabric Pricing
For the given PoC, we have used storage, warehouse tables, pipeline, notebook, reporting features. Below table lists similar features as compared with when using individual azure resource as against Microsoft Fabric.
The highlighted capacity are the basic configurations for Fabric that was used for the PoC.
Below is the basic costing for all the individual azure components:
3. OneLake
OneLake, a data lake-as-a-service solution is a single, logical data lake for the entire organization. It is like OneDrive which is designed to store all our analytical data. OneLake consists of a centralized tenant containing multiple workspaces. Each workspace then contains one or more Lakehouse/ warehouse.
Benefits of using OneLake: Lakehouse
1.?Centralized Tenant with distributed ownership for an organization: Fabric onelake eliminates the overhead of managing several resources for storing data, instead each workspace of onelake can be assigned to respective business groups and if needed, the data can be shared amongst multiple workspaces by providing required access. These workspaces are part of a centralized tenant (with no infrastructure to manage) which provides natural governance and compliance boundary under the control of a tenant admin. Each workspace is part of a capacity that is tied to a specific region and is billed separately.
2.?Integrated with other Azure Services: We can use just the onelake of Microsoft Fabric for storage & governance, and have our code developed & managed files as part of other azure services such as:
3.?OneLake Integrated with Windows: Just like OneDrive, Onelake can be managed and explored using windows. We can upload, download, and modify our files using windows which will be synced and reflected in our onelake objects. We can view all the data from the workspaces for which we have access to. Below screenshot is for a user who has access to both workspaces – wrkspc_deptA and wrkspc_deptB. We need to download and install “Onelake file explorer for Windows” to access and manage the onelake using the windows locally.
4.??Connect to Multiple Sources: We have the provision to pull the data from multiple sources to onelake.
5.?Shortcut: Onelake gives us the most valuable provision to maintain one copy of data. This is where Shortcut comes into picture wherein, we connect to data across business domains without actual data movement. The data will reside at source location, but we can create a reference to the data. This source file location can be within same workspace or across different workspaces within Onelake or outside Onelake (which can be ADLS GEN2 / AWS S3 / GCP storage / Dataverse). Irrespective of the location where the shortcut is created from, reference makes it appear as though the files and folders are stored locally. The highlighted option is a shortcut pulled from ADLS GEN2.
6.?Unified governance policies easier to enforce: Since all the workspaces reside in the same tenant, it is much easier to set the governance policy at tenant level.
7.?Data Mesh as a Service: Business groups can operate autonomously within a shared data lake eliminating the need to manage separate storage resources. A business domain can have multiple workspaces, which typically align with specific projects or teams.
Note: The multi-access-tier is not yet there in onelake and may be included in the future release of Microsoft. Till then we can use ADLS GEN2 shortcut and have the files present in hot, cool and archive access tier of ADLS GEN2. Also, when we try to get the archive tier files, we need to rehydrate them first.
Benefits of using OneLake: Warehouse
Warehouse in Fabric is built on relational schema to support SQL queries on structured data. It is integrated with Power BI to help us readily create visualization for reporting purpose.
1.?Cross database query by creating shortcuts: With us getting the data via multiple sources using Shortcut, we can query the data from different database and sources and pull-out analysis and metrics.
2.?Cross warehouse / lakehouse query: We can query the tables present as part of different warehouse and lakehouse SQL endpoint. Delta tables that are created through Spark in a Lakehouse are automatically discoverable in the SQL Endpoint as tables.?
3.?Visual query: If we want to avoid SQL coding, we can use the visual query to drag and drop the source and the transformation and build visual query to analyze the data.
4. Data Engineering: Spark Integration
Microsoft Fabric has integrated spark and notebooks just like how we use databricks for coding. It lets us configure the libraries, jars and wheel package on how that’s similarly done in databricks. We can integrate our notebook with visual studio code and benefit from the VS features as well. We can write our code in JAVA, Scala, Python and SQL.
1.?Visualizing our data: We can visualize our data in spark using reports. In the results section, we can select “Chart” option, select the visualization type, and key and values and have the chart created for analysis. We can download these charts in either of “JPEG / PNG / SVG” format.
2.?Starter Pools: Starter pools are a fast and easy way to use Spark on the Microsoft Fabric platform within seconds. Spark sessions can be used right away, instead of waiting for Spark to set up the nodes. It uses medium nodes that dynamically scale up based on Spark job requirements. We are charged for the capacity consumption when we start executing our notebook or spark job definition and not charged for the time the clusters are idle in the pool.
3.?Data Wrangler: Spark notebooks coexist with “Data Wrangler” functionality, wherein we can drag and drop the transformations, view the results parallelly and convert that into notebook code. The input passed to our data wrangler will be our source pandas data frame that we can pull through our onelake from same or different workspaces.
5. Data Factory: Pipeline and Dataflow
Data Factory combines the simplicity of Power Query with the scale and power of Azure Data Factory as Data Pipelines. It deals with the below items as shown in the below snapshot.
Dataflow GEN2:
Creation: Please refer to the below snapshot for creating a new dataflow inside the required workspace.
Dataflow Gen2 is mainly used for ETL purpose. This uses M query of Power BI for the transformations as well as it gives an option of visual query where we can achieve the same visually without writing any M-Query code. It can connect with both on-premises as well as cloud environments. For connecting with on-premises, gateway needs to be created and installed which is like SHIR in ADF. The result can also be saved in tables in KQL DB as it is tightly integrated in Fabric workspace.
Advantages:
1.?It does not use spark compute for transformation unlike dataflow gen 1 of ADF or Synapse, hence cost effective.
2.?It does not perform any action unless refresh is triggered for the dataflow.
3.?It provides a no coding ETL platform which is integrated with lake house, warehouse, KQL database to name a few, even across multiple workspaces across the fabric capacity.
4.?It can be called from data pipeline for orchestration as well as it can be scheduled for refresh directly if needed.
Note: If needed to load the result from dataflow to any table in custom schema inside fabric warehouse, the table needs to be created already in the same and needs to be accessed as existing table instead of new one. New table will be auto created while loading into warehouse if loaded into the default dbo schema.
Data Pipeline:
Creation: Please refer to the below snapshot for creating a new dataflow inside the required workspace.
About: This is the azure data factory/synapse analytics data pipeline integrated in Fabric environment.?
Advantages:
1.?The notifications section is a new and useful alert feature added only in fabric data pipeline which is not present in the conventional Azure Data Factory or Synapse Data Pipelines.
2.?The Outlook feature allows the pipeline to send custom mails to the targeted users via outlook mail along with customized mail body
3.?The teams feature allows the pipeline to send customized notifications to the target teams channel or group using a credential.
Note: The notebook or dataflow to be called or the source/target lake house/warehouse should be in the same workspace as that of the pipeline.?
6. Power BI
It ensures that business owners can access all the data in Fabric quickly and intuitively to make better decisions with data. It is fully integrated with the entire suite of Fabric products.
Direct Lake-
Direct Lake mode is a new dataset capability for analyzing very large data volumes in Power BI. Direct Lake is based on loading files directly from a data lake without having to query a Lakehouse endpoint and without having to import or duplicate data into a Power BI dataset. Direct Lake is a fast path to load the data from the lake straight into the Power BI engine, ready for analysis. The below diagram shows classic Import and Direct Query modes compare with the new Direct Lake mode.
Note: Direct Lake is supported on Power BI Premium P and Microsoft Fabric F SKUs only. It's not supported on Power BI Pro, Premium Per User, or Power BI Embedded A/EM SKUs.?
7. Real Time Analytics
Please find the below snapshot of Microsoft Fabric Synapse Real-Time Analytics Home Page.
This is a fully managed big data analytics platform optimized for streaming, and time-series data. It uses Kusto query language with exceptional performance for searching structured, semi-structured, and unstructured data. This is fully integrated with the entire suite of Fabric products, for data loading, data transformation and advanced visualization scenarios.
Below is a snapshot of sample KQL query and its graphical result in the Fabric environment.
8. Synapse Data Science
This section enables end-to-end data science workflows for the purpose of data enrichment and business insights. Data exploration, preparation, cleansing to experimentation, modelling, model scoring and serving of predictive insights to BI reports can be done in this section.
Please refer to the below snapshot of Data Science Home page. Here we can create machine learning Experiments, Models and Notebooks as well as can also import existing Notebooks on the Data Science Home page.
领英推荐
Please find the below sample from a data science notebook.
Please find the below sample experiment and model respectively in Fabric interface.
9. Data Governance: Purview Integration
Microsoft Purview Integration with Fabric: Microsoft Purview provides a unified data governance solution to help manage and govern our data loaded in tables in Microsoft Fabric. It helps us to discover the data and classify the sensitive data and create data lineage. We can create Data Loss Prevention Policy at workspace level hosted in Premium policy. We can create alert mail notifications as needed.
The Purview hub insights report enables administrators to visualize and analyze in greater details the extent and distribution of endorsement and sensitivity labelling throughout their organization's Fabric data estate.
10. Deployment and GIT Integration:
Git integration:
Git integration in Microsoft Fabric enables to integrate the development processes, tools, and best practices into the Fabric platform. The integration with source control is on a workspace level. It allows developers who are developing in Fabric Environment to:
Below is a snapshot of the steps of Git Integration in Fabric Environment.
Deployment in Fabric:
Fabric's deployment pipelines tool provides us with a production environment where we can collaborate to manage the lifecycle of organizational content. Deployment pipelines enable to develop and test content in the service before it reaches the end users. As of now the supported content types include reports, paginated reports, dashboards, datasets, and dataflows.
Please find the below snapshot to get a brief idea about deployment pipelines.
11. Summarized Benefits of Fabric
13. Architecture and Design of POC?
*Dummy data was used to do the POC.
We created a lake house and uploaded a structured file there as well as created a onelake shortcut from ADLS path to the lake house. The ADLS path also contains a csv file.
A Pyspark notebook is used to cleanse data and store the result as delta files. These are then used to prepare the respective dimensions and fact tables in the lake house.
A dataflow gen2 is used to prepare the transformations needed to make data ready for Power BI to consume. These transformation results are being saved as tables in warehouse. Respective views pointed to the tables are created in the custom schema of the warehouse. A star schema model is prepared in the warehouse with the final views.
The resultant model is being consumed by a Power BI report using Direct Lake.
14. Step by Step Process of POC
Create a fabric resource in the resource group:
Now, search for “Microsoft Fabric”
Give the fabric capacity name, region, and size as per our needs and click on “Review and Create”
Login to fabric portal
Workspace Creation: We can use an existing workspace if we have or can create a new workspace as below:
Add admin to our workspace:
Create new lakehouse:
Uploading files in Lakehouse:
File available in Onelake – Lakehouse:
Create shortcut with ADLS GEN2 as source:
Configure shortcut:?
The output can be seen below with shortcut symbol
Create a Notebook:
Create the notebook and populate the tables as below:?
Create new Pipeline:
Create new Dataflow:
Sample Model in Data Warehouse Section:
Create New Report in Power BI Section:
Sample Report:?
*Microsoft official documentation was referred while developing this article
Microsoft MVP | Data & Cloud Technologies Expert | Azure & Databricks & Fabric & Power BI Specialist | Helping Businesses Harness the Power of Data for Growth ????
4 个月Good overview article Darshana Ganesh ??
???????????????????? ???????????????????? (???? & ????????) ???? ???? ??????, ????-?????? ??????????????
4 个月Very helpful and detailed explanation! ????
Sustainability | ESG Data & Analytics Strategist at EY
4 个月Well crafted Darshana Ganesh . Great work