Exploring Features and Current Limitations of Microsoft Fabric

Exploring Features and Current Limitations of Microsoft Fabric

This article was co-authored by Arnab Dasgupta and was mentored by Sandeep Subash


1. Overview

Microsoft Fabric is an end-to-end analytics platform that provides a single and integrated environment for data professionals and the business to collaborate on data projects. Fabric provides a set of integrated services that enable us to ingest, store, process, and analyze data in a single environment. Microsoft Fabric integrates with tools the business needs to make decisions. It is a unified software-as-a-service (SaaS) offering, with all the data stored in a single open format in onelake. OneLake is accessible by all the analytics engines in the platform. Microsoft Fabric offers scalability, cost-effectiveness, accessibility from anywhere with an internet connection, and continuous updates and maintenance provided by Microsoft.

Microsoft Fabric covers all the features of existing equivalent technologies and few additional ones those are listed in this article. We do have certain features that are planned for future release from Microsoft and currently unavailable in Fabric which are covered in this article as well.

Fabric Home Page

2. Microsoft Fabric Pricing

For the given PoC, we have used storage, warehouse tables, pipeline, notebook, reporting features. Below table lists similar features as compared with when using individual azure resource as against Microsoft Fabric.

The highlighted capacity are the basic configurations for Fabric that was used for the PoC.

Fabric Pricing

Below is the basic costing for all the individual azure components:

Equivalent Tech Stack Pricing

3. OneLake

OneLake, a data lake-as-a-service solution is a single, logical data lake for the entire organization. It is like OneDrive which is designed to store all our analytical data. OneLake consists of a centralized tenant containing multiple workspaces. Each workspace then contains one or more Lakehouse/ warehouse.

OneLake Structure

Benefits of using OneLake: Lakehouse

1.?Centralized Tenant with distributed ownership for an organization: Fabric onelake eliminates the overhead of managing several resources for storing data, instead each workspace of onelake can be assigned to respective business groups and if needed, the data can be shared amongst multiple workspaces by providing required access. These workspaces are part of a centralized tenant (with no infrastructure to manage) which provides natural governance and compliance boundary under the control of a tenant admin. Each workspace is part of a capacity that is tied to a specific region and is billed separately.

2.?Integrated with other Azure Services: We can use just the onelake of Microsoft Fabric for storage & governance, and have our code developed & managed files as part of other azure services such as:

  • Azure Synapse Analytics
  • Azure Storage Explorer
  • Azure Databricks
  • Azure HDInsight

3.?OneLake Integrated with Windows: Just like OneDrive, Onelake can be managed and explored using windows. We can upload, download, and modify our files using windows which will be synced and reflected in our onelake objects. We can view all the data from the workspaces for which we have access to. Below screenshot is for a user who has access to both workspaces – wrkspc_deptA and wrkspc_deptB. We need to download and install “Onelake file explorer for Windows” to access and manage the onelake using the windows locally.

OneLake Integrated with Windows

4.??Connect to Multiple Sources: We have the provision to pull the data from multiple sources to onelake.

  • Shortcuts (Onelake objects, AWS, ADLS GEN2, GCP, Dataverse)
  • AWS (redshift, S3, RDS for SQL server)
  • Azure (blob storage, cosmos db., azure data explorer, ADLS GEN1 / GEN2, azure SQL database, azure db. for PostgreSQL, table storage, synapse analytics)
  • GCP (cloud storage)
  • Google analytics
  • KQL database
  • Snowflake
  • Salesforce objects / reports
  • Microsoft fabric (dataflows, DataMart, warehouse, Lakehouse, KQL database)
  • SharePoint objects and HTTP links
  • On prem and local files (excel, text, xml, json, folder, pdf, parquet)
  • Other services (Web trends analytics, QuickBase, Dynamics 365, Adobe analytics, Microsoft exchange online, Apache Impala, Spark database, data verse, Microsoft 365, etc.)

5.?Shortcut: Onelake gives us the most valuable provision to maintain one copy of data. This is where Shortcut comes into picture wherein, we connect to data across business domains without actual data movement. The data will reside at source location, but we can create a reference to the data. This source file location can be within same workspace or across different workspaces within Onelake or outside Onelake (which can be ADLS GEN2 / AWS S3 / GCP storage / Dataverse). Irrespective of the location where the shortcut is created from, reference makes it appear as though the files and folders are stored locally. The highlighted option is a shortcut pulled from ADLS GEN2.

Data pulled via Shortcut

6.?Unified governance policies easier to enforce: Since all the workspaces reside in the same tenant, it is much easier to set the governance policy at tenant level.

7.?Data Mesh as a Service: Business groups can operate autonomously within a shared data lake eliminating the need to manage separate storage resources. A business domain can have multiple workspaces, which typically align with specific projects or teams.

Note: The multi-access-tier is not yet there in onelake and may be included in the future release of Microsoft. Till then we can use ADLS GEN2 shortcut and have the files present in hot, cool and archive access tier of ADLS GEN2. Also, when we try to get the archive tier files, we need to rehydrate them first.

Benefits of using OneLake: Warehouse

Warehouse in Fabric is built on relational schema to support SQL queries on structured data. It is integrated with Power BI to help us readily create visualization for reporting purpose.

1.?Cross database query by creating shortcuts: With us getting the data via multiple sources using Shortcut, we can query the data from different database and sources and pull-out analysis and metrics.

2.?Cross warehouse / lakehouse query: We can query the tables present as part of different warehouse and lakehouse SQL endpoint. Delta tables that are created through Spark in a Lakehouse are automatically discoverable in the SQL Endpoint as tables.?

3.?Visual query: If we want to avoid SQL coding, we can use the visual query to drag and drop the source and the transformation and build visual query to analyze the data.

Visual Query

4. Data Engineering: Spark Integration

Microsoft Fabric has integrated spark and notebooks just like how we use databricks for coding. It lets us configure the libraries, jars and wheel package on how that’s similarly done in databricks. We can integrate our notebook with visual studio code and benefit from the VS features as well. We can write our code in JAVA, Scala, Python and SQL.

Visualizing data

1.?Visualizing our data: We can visualize our data in spark using reports. In the results section, we can select “Chart” option, select the visualization type, and key and values and have the chart created for analysis. We can download these charts in either of “JPEG / PNG / SVG” format.

2.?Starter Pools: Starter pools are a fast and easy way to use Spark on the Microsoft Fabric platform within seconds. Spark sessions can be used right away, instead of waiting for Spark to set up the nodes. It uses medium nodes that dynamically scale up based on Spark job requirements. We are charged for the capacity consumption when we start executing our notebook or spark job definition and not charged for the time the clusters are idle in the pool.

Starter Pool

3.?Data Wrangler: Spark notebooks coexist with “Data Wrangler” functionality, wherein we can drag and drop the transformations, view the results parallelly and convert that into notebook code. The input passed to our data wrangler will be our source pandas data frame that we can pull through our onelake from same or different workspaces.

Data Wrangler

5. Data Factory: Pipeline and Dataflow

Data Factory combines the simplicity of Power Query with the scale and power of Azure Data Factory as Data Pipelines. It deals with the below items as shown in the below snapshot.

Data Factory Home Portal

Dataflow GEN2:

Creation: Please refer to the below snapshot for creating a new dataflow inside the required workspace.

Dataflow Gen2 Option

Dataflow Gen2 is mainly used for ETL purpose. This uses M query of Power BI for the transformations as well as it gives an option of visual query where we can achieve the same visually without writing any M-Query code. It can connect with both on-premises as well as cloud environments. For connecting with on-premises, gateway needs to be created and installed which is like SHIR in ADF. The result can also be saved in tables in KQL DB as it is tightly integrated in Fabric workspace.

Advantages:

1.?It does not use spark compute for transformation unlike dataflow gen 1 of ADF or Synapse, hence cost effective.

2.?It does not perform any action unless refresh is triggered for the dataflow.

3.?It provides a no coding ETL platform which is integrated with lake house, warehouse, KQL database to name a few, even across multiple workspaces across the fabric capacity.

4.?It can be called from data pipeline for orchestration as well as it can be scheduled for refresh directly if needed.

Note: If needed to load the result from dataflow to any table in custom schema inside fabric warehouse, the table needs to be created already in the same and needs to be accessed as existing table instead of new one. New table will be auto created while loading into warehouse if loaded into the default dbo schema.

Data Pipeline:

Creation: Please refer to the below snapshot for creating a new dataflow inside the required workspace.

Data Pipeline Option

About: This is the azure data factory/synapse analytics data pipeline integrated in Fabric environment.?

Created Pipeline

Advantages:

1.?The notifications section is a new and useful alert feature added only in fabric data pipeline which is not present in the conventional Azure Data Factory or Synapse Data Pipelines.

Notifications

2.?The Outlook feature allows the pipeline to send custom mails to the targeted users via outlook mail along with customized mail body

3.?The teams feature allows the pipeline to send customized notifications to the target teams channel or group using a credential.

Note: The notebook or dataflow to be called or the source/target lake house/warehouse should be in the same workspace as that of the pipeline.?


6. Power BI

It ensures that business owners can access all the data in Fabric quickly and intuitively to make better decisions with data. It is fully integrated with the entire suite of Fabric products.

Direct Lake-

Direct Lake mode is a new dataset capability for analyzing very large data volumes in Power BI. Direct Lake is based on loading files directly from a data lake without having to query a Lakehouse endpoint and without having to import or duplicate data into a Power BI dataset. Direct Lake is a fast path to load the data from the lake straight into the Power BI engine, ready for analysis. The below diagram shows classic Import and Direct Query modes compare with the new Direct Lake mode.

Note: Direct Lake is supported on Power BI Premium P and Microsoft Fabric F SKUs only. It's not supported on Power BI Pro, Premium Per User, or Power BI Embedded A/EM SKUs.?


7. Real Time Analytics

Please find the below snapshot of Microsoft Fabric Synapse Real-Time Analytics Home Page.

Real-Time Analytics Home Page

This is a fully managed big data analytics platform optimized for streaming, and time-series data. It uses Kusto query language with exceptional performance for searching structured, semi-structured, and unstructured data. This is fully integrated with the entire suite of Fabric products, for data loading, data transformation and advanced visualization scenarios.

Below is a snapshot of sample KQL query and its graphical result in the Fabric environment.

Graphical Result Analysis

8. Synapse Data Science

This section enables end-to-end data science workflows for the purpose of data enrichment and business insights. Data exploration, preparation, cleansing to experimentation, modelling, model scoring and serving of predictive insights to BI reports can be done in this section.

Please refer to the below snapshot of Data Science Home page. Here we can create machine learning Experiments, Models and Notebooks as well as can also import existing Notebooks on the Data Science Home page.

Data Science Home Page

Please find the below sample from a data science notebook.

Sample Data Science Notebook

Please find the below sample experiment and model respectively in Fabric interface.

Sample Experiment

9. Data Governance: Purview Integration

Microsoft Purview Integration with Fabric: Microsoft Purview provides a unified data governance solution to help manage and govern our data loaded in tables in Microsoft Fabric. It helps us to discover the data and classify the sensitive data and create data lineage. We can create Data Loss Prevention Policy at workspace level hosted in Premium policy. We can create alert mail notifications as needed.

Purview Integration

The Purview hub insights report enables administrators to visualize and analyze in greater details the extent and distribution of endorsement and sensitivity labelling throughout their organization's Fabric data estate.

Purview hub Insights Report

10. Deployment and GIT Integration:

Git integration:

Git integration in Microsoft Fabric enables to integrate the development processes, tools, and best practices into the Fabric platform. The integration with source control is on a workspace level. It allows developers who are developing in Fabric Environment to:

  • Backup and version their work
  • Revert to previous stages as needed
  • Collaborate with others or work alone using Git branches
  • Leverage the capabilities of familiar source control tools to manage Fabric items.

Below is a snapshot of the steps of Git Integration in Fabric Environment.

Git Integration

Deployment in Fabric:

Fabric's deployment pipelines tool provides us with a production environment where we can collaborate to manage the lifecycle of organizational content. Deployment pipelines enable to develop and test content in the service before it reaches the end users. As of now the supported content types include reports, paginated reports, dashboards, datasets, and dataflows.

Please find the below snapshot to get a brief idea about deployment pipelines.

Deployment Pipelines

11. Summarized Benefits of Fabric

  • Cost Savings: As all the tech services are packaged as part of the same workspace(s) that share same cost (based on utilization), significant cost savings are realized.
  • Integrated way of working: We now have multiple tech stack integrated in a workspace, so we don’t have to manually integrate and connect them like we do in our current/traditional flow.
  • Data Mesh made easy: We can share the same data/ workspaces (used by individual projects) and establish a data mesh architecture framework. The bronze, silver and gold layer data can be also clubbed into different lakehouse, and access can be provided to the respective groups accordingly.
  • Unified Data Governance: We can manage the table level access at a tenant level which comprises of all the workspaces which are part of it.
  • Time Savings: In regular architecture where we integrate multiple components, for example calling databricks notebook or synapse component from ADF pipeline or when a notebook gets executed for the first time after a certain interval, we see that a certain time is consumed for it to become active. But in Microsoft Fabric, we have all the features integrated and starter pools have been used which allows for activation within few seconds even at the first time which reduces the overall execution time of the flow to a larger extent.
  • Data Wrangler: It is not necessary for one to have programming knowledge to create and use spark features (notebooks). Just by getting the pandas source, one can use the data wrangler feature to manually drag and drop the transformations and convert everything to python code just by a clicking and using it accordingly.
  • Shortcuts: We can directly query the data in onelake, ADLS GEN2, AWS S3, GCP storage and Dataverse (more storage services may be added in future releases) without having to physically move the same.


13. Architecture and Design of POC?

POC Design

*Dummy data was used to do the POC.

We created a lake house and uploaded a structured file there as well as created a onelake shortcut from ADLS path to the lake house. The ADLS path also contains a csv file.

A Pyspark notebook is used to cleanse data and store the result as delta files. These are then used to prepare the respective dimensions and fact tables in the lake house.

A dataflow gen2 is used to prepare the transformations needed to make data ready for Power BI to consume. These transformation results are being saved as tables in warehouse. Respective views pointed to the tables are created in the custom schema of the warehouse. A star schema model is prepared in the warehouse with the final views.

The resultant model is being consumed by a Power BI report using Direct Lake.


14. Step by Step Process of POC

Create a fabric resource in the resource group:

Now, search for “Microsoft Fabric”

Give the fabric capacity name, region, and size as per our needs and click on “Review and Create”

Login to fabric portal

https://app.fabric.microsoft.com/

Workspace Creation: We can use an existing workspace if we have or can create a new workspace as below:

Add admin to our workspace:

Create new lakehouse:

Uploading files in Lakehouse:

File available in Onelake – Lakehouse:

Create shortcut with ADLS GEN2 as source:

Configure shortcut:?

The output can be seen below with shortcut symbol

Create a Notebook:

Create the notebook and populate the tables as below:?

Create new Pipeline:

Create new Dataflow:

Sample Model in Data Warehouse Section:

Create New Report in Power BI Section:

Sample Report:?


*Microsoft official documentation was referred while developing this article

Adrian Chodkowski

Microsoft MVP | Data & Cloud Technologies Expert | Azure & Databricks & Fabric & Power BI Specialist | Helping Businesses Harness the Power of Data for Growth ????

4 个月

Good overview article Darshana Ganesh ??

Soumyadeep Debnath

???????????????????? ???????????????????? (???? & ????????) ???? ???? ??????, ????-?????? ??????????????

4 个月

Very helpful and detailed explanation! ????

Sriram Balakrishnan

Sustainability | ESG Data & Analytics Strategist at EY

4 个月

Well crafted Darshana Ganesh . Great work

要查看或添加评论,请登录

社区洞察

其他会员也浏览了