Deep dive into Microsoft Fabric
Mateusz Sawicki
I'm not another AI expert but I'm pretty good at data engineering ???
?
Introduction?
During Microsoft Build 2023 conference, Fabric was introduced. It is new end-to-end analytics platform offered by Giant of Redmond. The service offers the most comprehensive analytical capabilities – from ETL, through data modelling and visualization up to machine learning models and AI. After the launch of this product social media were on fire. Thousands of data professionals, including Microsoft MVPs, started to publish content about this new hot topic. Even I wrote some posts about it, one of them beign OneMeme. For anyone who has not seen this – OneMeme is a meme explaining the concept of Microsoft Fabric.
It is an obvious reference to OneLake which is the foundation of Fabric. When I published my OneMeme on LinkedIn, I also promised that I would write an article about Microsoft Fabric. In this publication, I am going to explain what Fabric is, its capabilities and advantages and why you should learn this tool as soon as possible. All thoughts and insights are being made from the perspective of business intelligence professional, so I focus mostly on ETL, data modelling and visualization. Just due to the fact I have relevant experience in those areas.??
Why the new tool??
It is a question that many asked when Microsoft released Fabric. My initial feelings towards it were doubts. I couldn't understand why they were introducing a new data analytics platform when Power BI, Azure Synapse, and all the other Azure data services were already available. After overcoming my initial hesitations, I delved into the realm of Fabric and began to understand that it made sense and could bring value to businesses. The most crucial concept behind Fabric is to unify Microsoft's analytical services in one place. Anyone who has worked with the Azure data platform and Power BI knows that building an analytical solution can be architecturally challenging.?
Designing an analytical solution on Azure can be truly overwhelming. Developers and architects have to determine which services will be necessary, how to integrate them, and consider the cost implications. These challenges are precisely what Fabric addresses. Firstly, it provides a unified platform where all the necessary building blocks are readily available. There's no need to spend time figuring out which services are needed. Secondly, the elements are automatically connected, eliminating the need for developers to create identities, users, and assign permissions, among other tasks. Anyone who has attempted this knows how frustrating it can be, particularly in larger enterprises where decision-making can be an ongoing and arduous process. Thirdly, OneLake enables users to store their data in one place, reducing data replication and the duplication of similar tasks. Fourthly, data governance is much more effective when all users use one service. Additionally, Fabric offers some excellent governance capabilities. Fifth, Fabric pricing is straightforward – users only have to pay for capacity and Power BI licenses (the price list is available on Microsoft's website), which are easy to estimate, along with storage, which is negligible for most companies. In the next few paragraphs, I will elaborate on each of the aforementioned advantages of Microsoft Fabric.?
All-in-one tool?
Microsoft Fabric is a complete analytics solution tailored for businesses. It offers a broad spectrum of services like data factory, data science, real-time analytics, and business intelligence, all under one roof. The goal of this platform is to streamline analytics needs by offering a unified solution, thereby avoiding the need to combine separate services from multiple vendors. Fabric is built on the backbone of Software as a Service (SaaS), which makes integration straightforward. Microsoft Fabric combines elements from Power BI, Azure Synapse, and Azure Data Explorer into a single cohesive environment, offering a full suite of analytics experiences designed to work in tandem.?
Data Engineering allows developers to build what's known as a data lakehouse. It's founded on Apache Spark, and the native data format in Fabric is delta (you can learn more about delta format in this article by Nikola Ili?), which aligns perfectly with data lakehouse concepts. There are notebooks available where you can write code to interact with Spark in Python, SQL, R, or Scala, and these can be scheduled with ease.?
Data Factory is a tool recognized by Azure data engineers for handling ETL/ELT workloads. It enables users to create data integration easily.?
Data Science helps build, deploy, and manage machine learning models. It links with Azure Machine Learning to offer built-in experiment tracking and a model registry.?
Data Warehouse corresponds to Azure Synapse Dedicated SQL Pool, a contemporary cloud data warehouse utilizing MPP architecture. It's little funny because the data in the warehouse "tables" is stored in delta file format, which is a characteristic feature of a lakehouse. Nevertheless, this tool lets developers construct a data warehouse that behaves like a relational database, where data is orderly stored in tables.?
Real-Time Analytics is an engine designed for observational data analytics. It can handle data sourced from various platforms such as apps, IoT devices, and human interactions.?
领英推荐
Power BI is the foundation for Fabric's interface. It is well-known for its data modeling and visualization capabilities.?
Services connected under the hood?
In Microsoft Fabric, all services are interconnected by default. There's no need to struggle with assigning sufficient permissions between services as is the case with Azure. Developers are spared from setting up managed identities, VNets, and other boring administrative tasks. When you utilize Fabric, you can set aside infrastructure concepts like resource groups, RBAC (Role-Based Access Control), Azure Resource Manager, redundancy, or regions. That being said, nowadays it's still necessary to be familiar with these concepts. Not all companies will transition to Fabric immediately, and proficient developers should understand how to handle Azure workloads.?
One data storage?
Microsoft Fabric is built upon the foundational layer of OneLake, which serves as a repository for the data. OneLake, constructed atop Azure Data Lake Storage Gen2, eliminates the necessity of creating this resource in Azure—in fact, there's no need for even having an Azure account to operate within Fabric. The Fabric license encompasses OneLake to store all data used in a given tenant. This methodology tries eliminating data silos, a ubiquitous problem in many organizations. However, the real game-changers introduced by OneLake are the shortcuts. In Microsoft OneLake, shortcuts empower developers to consolidate data across domains, clouds, and accounts, thus creating a single virtualized data lake for the entire enterprise. All Fabric tools can connect directly to existing data sources such as ADLS, S3, and of course OneLake via a unified namespace. OneLake manages permissions and credentials, so each Fabric experience doesn't require separate configuration for each data source. Shortcuts are objects in OneLake that link to other storage locations, which can be internal or external to OneLake. They behave like symbolic links and operate independently of their target. If a shortcut is deleted, the target remains unaffected; conversely, if the target path is altered or deleted, the shortcut may break.??
OneLake's capabilities don't end with shortcuts. Another standout feature is DirectLake. DirectLake allows for the loading of parquet-formatted files directly from a data lake, bypassing the need to query a lakehouse endpoint or import or duplicate data into a Power BI dataset. It offers a quick route to load data from the lake directly into the Power BI’s VertiPaq for analysis. DirectLake eliminates the import process by loading data straight from OneLake. Diverging from DirectQuery, it doesn't translate to other query languages or execute queries on other database systems, thus delivering performance akin to import mode. As there's no explicit import process, any changes at the data source can be immediately reflected, effectively merging the benefits of both DirectQuery and import modes while circumventing their drawbacks. DirectLake can be an optimal choice for analyzing sizable datasets and datasets with frequent updates at the data source. This feature is supported by V-Order, a write-time optimization for the parquet file format that enables swift reads under Microsoft Fabric compute engines such as Power BI, SQL, Spark, and others. V-Order achieves this by applying special sorting, row group distribution, dictionary encoding, and compression on parquet files, thereby reducing the requirement for network, disk, and CPU resources in compute engines to read it, resulting in cost-efficiency and improved performance. While V-Order sorting can impact write times by an average of 15%, it offers up to 50% more compression.?
Data governance with domains?
Power BI already boasts an impressive array of data governance capabilities such as data sensitivity labels, workspace permissions, row-level security, Purview, among others. The Purview hub is also an integral element of Fabric, offering centralized administration and governance across all experiences, with permissions automatically applied across all the underlying services. Data sensitivity labels are likewise automatically inherited across the suite's items. While these features are invaluable, a new element known as 'domains' has recently emerged. A domain serves as a logical grouping mechanism for all data relevant to a specific area or field within an organization. In order to organize data into domains, workspaces are linked with specific domains. Once a workspace is connected with a domain, all items within that workspace also become associated with that domain, receiving a domain attribute as part of their metadata. Domains offer a higher level of workspace grouping that can be used, for instance, to create development, test, and production environments or to partition data among different organizational units within an enterprise. This feature enables organizations to manage their data in accordance with their unique regulations, restrictions, and requirements. ?
Simplified pricing?
Anyone who's had the opportunity to develop an analytical solution on Azure knows that navigating the pricing structure for various services can be quite complex. In fact, I can assert without hyperbole that crafting cost-effective solutions on Azure is somewhat of an art form. If someone can accurately predict the cost of an analytical solution on Azure, they're likely a veritable maestro of the Azure Data Platform. Microsoft Fabric significantly simplifies this pricing puzzle. Its licensing model aligns with that of Power BI premium or embedded capacities. Costs are incurred for capacity and storage in OneLake. Capacities are available for purchase via the Azure portal, offering flexibility to enterprises through pay-as-you-go hourly or monthly options. This is particularly convenient for users who do not require round-the-clock access. In the forthcoming months, Microsoft plans to augment the capabilities of Fabric capacities, including the introduction of Azure Reservations. Similar to features found in services like Synapse, reservations can lead to cost reductions, making it worth considering booking capacity for extended periods. As for OneLake storage pricing, it is comparable to Azure ADLS (Azure Data Lake Storage) where users pay per GB, which is often negligible. This model is easy to estimate and can save accountancy department from dealing with unexpected, exorbitant bills.?
However, one aspect of Fabric's pricing I find less appealing is the division of paid capacity. When you purchase capacity, you receive a set number of vCores, divided equally between front-end and back-end tasks. In situations of high-demand workloads, such as data warehousing or ML tasks, you may need additional back-end capacity without impacting front-end operations. This can lead to higher costs than equivalent solutions built with other services.
Some reasonable conclusions?
This article may come across as a one-sided celebration of a shiny new tool and in all honesty, it is. I'm a huge fan of Microsoft Fabric, an all-in-one analytical solution for enterprises. I've explored numerous features and facilitations that Fabric offers in comparison to Azure services. Don't get me wrong; I hold a deep admiration for Azure, particularly for Synapse and ADF. They're modern, comprehensive solutions that will be used for a long time, given the significant investments enterprises have made in them to become data-driven organizations.?We have to remember, however, that Fabric is built on top of Azure services and simplifies their use. While I criticized the complicated pricing structure of Azure in the previous section, I want to clarify that many projects might be more cost-effective to implement in the traditional Azure way rather than using Fabric. For instance, building a lakehouse with Synapse Serverless or Databricks is likely to be more affordable than implementing the same solution with Fabric—and you don't need to be an Azure pricing magician to realize this. There are likely numerous other solutions that could be built without Fabric to meet an enterprise's data requirements.?My motivation in writing this article was to highlight the strengths of this new tool, consolidate my knowledge, and encourage readers to familiarize themselves with Fabric.?
?