Getting to Know Microsoft Fabric: An Introduction
Vidushraj Chandrasekaran
Data Engineer @ Axiata Digital Labs ???? | GCP Certified Data Engineer | MS Certified Data Engineer | 5x Azure | Data Engineering | BSc (Hons) in EEE | AMIE(SL) | AEng(ECSL)
Microsoft Fabric represents a comprehensive analytics platform, offering a unified environment where data professionals and business entities can collaborate seamlessly on data initiatives. Fabric comprises a suite of integrated services designed to facilitate data ingestion, storage, processing, and analysis within a unified ecosystem.
Fabric includes the following services:
Fabric is a cohesive software-as-a-service (SaaS) solution, consolidating all data within a unified format in OneLake. Accessible to all analytics engines on the platform, OneLake serves as the centralized repository for data.
OneLake
Explore Fabrics's experiences
Fabric administration is centralized in the admin center.
Lakehouses merge data lake storage flexibility with data warehouse analytics. Lakehouse is the foundation of MS Fabric, which is built on top of the OneLake scalable storage and uses Apache Spark and SQL as compute engines for big data processing.
Lakehouse = Flexible & Scalable storage of a data lake + Ability to query & analyze data of a warehouse.
Lakehouse = Data Lake + Data Warehouse
Data Lake:
Data Warehouse:
Shortcuts enable you to integrate data into your lakehouse while keeping it stored in external storage.
Ways to ingest data into a lakehouse
Ways to transform the data
Microsoft Fabric offers Spark cluster support, facilitating the analysis and processing of data at scale within a Lakehouse environment.
The below are the main settings needed to consider when configuring Apache Spark.
The Spark catalog is a metastore for relational data objects such as views and tables. The Spark runtime can use the catalog to seamlessly integrate code written in any Spark-supported language with SQL expressions that may be more natural to some data analysts or developers.
The preferred format in Microsoft Fabric is delta, which is the format for a relational data technology on Spark named Delta Lake.
Tables within a Microsoft Fabric lakehouse utilize the Delta Lake storage format, a prevalent choice in Apache Spark environments.
Delta Lake is an open-source storage layer that adds relational database semantics to Spark-based data lake processing.
Benefits of Delta Tables:
Managed vs External tables
Managed Table: The table definition in the metastore and the underlying data files are both managed by the Spark runtime for the Fabric Lakehouse. Deleting the table will also delete the underlying files from the Tables storage location for the lakehouse.
External Table: The table definition in the metastore is mapped to an alternative file storage location. Deleting an external table from the Lakehouse meta store does not delete the associated data files.
Core pipeline concepts
Dataflows Gen2 are used to ingest and transform data from multiple sources, and then land the cleaned data to another destination. Dataflows are a type of cloud-based ETL tool for building and executing scalable data transformation processes. Dataflows Gen2 provides an excellent option for data transformations in Microsoft Fabric.
Fabric notebooks are the best choice if you:
Consider the below optimization functions for even more performant data ingestion.
Fabrics's Data Wrangler then lets them explore the data and generate transformation code for their specific needs. The Fabric Lakehouse, blending data lakes and data warehouses, offers an ideal platform to manage and analyze this data. Data Lakehouses in Fabric are built on the Delta Lake format, which natively supports ACID transactions.
Data transformation involves altering the structure or content of data to meet specific requirements. Tools for data transformation in Fabric include Dataflows (Gen2) and notebooks. Dataflows are a great option for smaller semantic models and simple transformations. Notebooks are a better option for larger semantic models and more complex transformations. Notebooks also allow you to save your transformed data as a managed Delta table in the lakehouse, ready for reporting.
Data orchestration refers to the coordination and management of multiple data-related processes, ensuring they work together to achieve a desired outcome. The primary tool for data orchestration in Fabric is pipelines. A pipeline is a series of steps that move data from one place to another, in this case, from one layer of the medallion architecture to the next. Pipelines can be automated to run on a schedule or triggered by an event.
Secure your lakehouse by ensuring that only authorized users can access data. In Fabric, you can do this by setting permissions at the workspace or item level.
Workspace permissions control access to all items within a workspace.
Item-level permissions control access to specific items within a workspace, and could be used when you're collaborating with colleagues who aren't in the same workspace, or they only need access to a single, specific item.
Data warehouses are analytical stores built on a relational schema to support SQL queries.
领英推荐
The process of building a modern data warehouse typically consists of:
Design a data warehouse
Fact tables: Contains the numerical data that you want to analyze.
Dimension tables: Contain descriptive information about the data in the fact tables.
Common for a dimension table to include two key columns:
Special types of dimension tables:
By performing normalization we can reduce duplication. In a data warehouse, however, the dimension data is generally de-normalized to reduce the number of joins required to query the data.
Fabrics's Lakehouse is a collection of files, folders, tables, and shortcuts that act like a database over a data lake.
There are a few ways to ingest data into a Fabric data warehouse, including Pipelines, Dataflows, cross-database querying, and COPY INTO command.
Security
The data warehouse in Microsoft Fabric is powered up with Synapse Analytics by offering a rich set of features that make it easier to manage and analyze data.
Data ingestion/extract is about moving raw data from various sources into a central repository.
All Fabric data items like data warehouses and lakehouses store their data automatically in OneLake in Delta Parquet format.
Stage your data
You may have to build and work with auxiliary objects involved in a load operation such as tables, stored procedures, and functions. These auxiliary objects are commonly referred to as staging. Staging objects act as temporary storage and transformation areas. They can share resources with a data warehouse, or live in their own storage area.
Staging serves as an abstraction layer, simplifying and facilitating the load operation to the final tables in the data warehouse
Types of Data Loads
There are several types of slowly changing dimensions in a data warehouse, with type 1 and type 2 being the most frequently used.
The mechanism for detecting changes in source systems is crucial for determining when records are inserted, updated, or deleted. Change Data Capture (CDC), change tracking, and triggers are all features available for managing data tracking in source systems such as SQL Server.
When it comes to loading data in a data warehouse, there are several considerations to keep in mind.
Power BI Report Optimization Techniques
The Deployment Pipeline tool enables users to manage the development lifecycle of content within their tenant. The feature is available within the Power BI Service with a Premium Capacity license.
Microsoft Fabric is a SaaS solution for end-to-end data analytics.
Understand Fabric concepts: tenant, capacity, domain, workspace, and item
A Fabric tenant is a dedicated space for organizations to create, store, and manage Fabric items. There's often a single instance of Fabric for an organization, and it's aligned with Microsoft Entra ID. The Fabric tenant maps to the root of OneLake and is at the top level of the hierarchy.
Capacity is a dedicated set of resources that is available at a given time to be used. A tenant can have one or more capacities associated with it. Capacity defines the ability of a resource to perform an activity or to produce output. Different items consume different capacities at a certain time. Fabric offers capacity through the Fabric SKU and Trials.
A domain is a logical grouping of workspaces. Domains are used to organize items in a way that makes sense for your organization. You can group things together in a way that makes it easier for the right people to have access to the right workspaces. For example, you might have a domain for sales, another for marketing, and another for finance.
A workspace is a collection of items that bring together different functionality in a single tenant. It acts as a container that leverages capacity for the work that is executed and provides controls for who can access the items in it. For example, in a sales workspace, users associated with the sales organization can create a data warehouse, run notebooks, create semantic models, create reports, etc.
Fabric items are the building blocks of the Fabric platform. They're the objects that you create and manage in Fabric. There are different types of items, such as data warehouses, data pipelines, semantic models, reports, and dashboards.
Describe admin tasks:
Fabric has a few built-in governance features to help you manage and control your data. Endorsement is a way for you as an admin to designate specific Fabric items as trusted and approved for use across the organization.
Admins can also make use of the scanner API to scan Fabric items for sensitive data, and the data lineage feature to track the flow of data through Fabric.
Resource: