登录查看更多内容

Getting to Know Microsoft Fabric: An Introduction

Vidushraj Chandrasekaran

Data Engineer???? | GCP Certified Data Engineer | MS Certified Data Engineer | 6x Azure | Data Engineering | BSc (Hons) in EEE | AMIE(SL) | AEng(ECSL)

发布日期: 2024年3月31日

Microsoft Fabric represents a comprehensive analytics platform, offering a unified environment where data professionals and business entities can collaborate seamlessly on data initiatives. Fabric comprises a suite of integrated services designed to facilitate data ingestion, storage, processing, and analysis within a unified ecosystem.

Fabric includes the following services:

Data Engineering
Data integration
Data warehousing
Real-time analytics
Data Science
Business intelligence

Fabric is a cohesive software-as-a-service (SaaS) solution, consolidating all data within a unified format in OneLake. Accessible to all analytics engines on the platform, OneLake serves as the centralized repository for data.

OneLake

OneLake embodies Fabric's lake-centric architecture, fostering a unified environment where data professionals and business stakeholders can collaborate seamlessly on data initiatives.
Fabric's data warehousing, data engineering (Lakehouses and Notebooks), data integration (pipelines and dataflows), real-time analytics, and Power BI all use OneLake as their native store without needing any extra configuration.
OneLake leverages Azure Data Lake Storage (ADLS) as its foundation, accommodating data storage in various formats such as Delta, Parquet, CSV, JSON, and others.

Explore Fabrics's experiences

Synapse Data Engineering
Synapse Data Warehouse
Synapse Data Science
Synapse Real-Time Analytics
Data Factory
Power BI

Fabric administration is centralized in the admin center.

Lakehouses merge data lake storage flexibility with data warehouse analytics. Lakehouse is the foundation of MS Fabric, which is built on top of the OneLake scalable storage and uses Apache Spark and SQL as compute engines for big data processing.

Lakehouse = Flexible & Scalable storage of a data lake + Ability to query & analyze data of a warehouse.

Lakehouse = Data Lake + Data Warehouse

Data Lake:

Scalable, distributed file storage.
Flexible schema-on-read semantics
Big data technology compatibility

Data Warehouse:

Relational schema modeling
SQL-based querying
Proven basis for reporting and analytics

Shortcuts enable you to integrate data into your lakehouse while keeping it stored in external storage.

Ways to ingest data into a lakehouse

Upload: Manual process
Dataflows (Gen2): Using power query online
Notebooks
Data Factory pipelines

Ways to transform the data

Apache Spark
SQL analytic endpoint
Dataflows (Gen2)
Data pipelines

Microsoft Fabric offers Spark cluster support, facilitating the analysis and processing of data at scale within a Lakehouse environment.

The below are the main settings needed to consider when configuring Apache Spark.

Node Family: Types of virtual machines (memory-optimized nodes provide optimal performance)
Runtime version: Version of Spark
Spark Properties:

The Spark catalog is a metastore for relational data objects such as views and tables. The Spark runtime can use the catalog to seamlessly integrate code written in any Spark-supported language with SQL expressions that may be more natural to some data analysts or developers.

The preferred format in Microsoft Fabric is delta, which is the format for a relational data technology on Spark named Delta Lake.

Tables within a Microsoft Fabric lakehouse utilize the Delta Lake storage format, a prevalent choice in Apache Spark environments.

Delta Lake is an open-source storage layer that adds relational database semantics to Spark-based data lake processing.

Benefits of Delta Tables:

Relational tables that support querying and data modification.
Support for ACID transactions.
Data versioning and time travel.
Support for batch and streaming data.
Standard formats and interoperability

Managed vs External tables

Managed Table: The table definition in the metastore and the underlying data files are both managed by the Spark runtime for the Fabric Lakehouse. Deleting the table will also delete the underlying files from the Tables storage location for the lakehouse.

External Table: The table definition in the metastore is mapped to an alternative file storage location. Deleting an external table from the Lakehouse meta store does not delete the associated data files.

Core pipeline concepts

Activities: Executable tasks in a pipeline. Data transformation Activities - Copy, Data transformation, Running notebook, Stored procedure activities to run SQL code, Delete data activities.Control flow Activities - Implement loops, conditional branching, or manage variable and parameter values.
Parameters
Pipeline runs

Dataflows Gen2 are used to ingest and transform data from multiple sources, and then land the cleaned data to another destination. Dataflows are a type of cloud-based ETL tool for building and executing scalable data transformation processes. Dataflows Gen2 provides an excellent option for data transformations in Microsoft Fabric.

Fabric notebooks are the best choice if you:

Handle large external data
Need complex transformations

Consider the below optimization functions for even more performant data ingestion.

V-Order - Enables faster and more efficient reads by various compute engines, such as Power BI, SQL, and Spark. V-order applies special sorting, distribution, encoding, and compression on parquet files at write-time.
Optimize write - Improves the performance & reliability by reducing the number of files written & increasing their size. It's useful for scenarios where the Delta tables have suboptimal or nonstandard file sizes, or where the extra latency is tolerable.

Fabrics's Data Wrangler then lets them explore the data and generate transformation code for their specific needs. The Fabric Lakehouse, blending data lakes and data warehouses, offers an ideal platform to manage and analyze this data. Data Lakehouses in Fabric are built on the Delta Lake format, which natively supports ACID transactions.

Data transformation involves altering the structure or content of data to meet specific requirements. Tools for data transformation in Fabric include Dataflows (Gen2) and notebooks. Dataflows are a great option for smaller semantic models and simple transformations. Notebooks are a better option for larger semantic models and more complex transformations. Notebooks also allow you to save your transformed data as a managed Delta table in the lakehouse, ready for reporting.

Data orchestration refers to the coordination and management of multiple data-related processes, ensuring they work together to achieve a desired outcome. The primary tool for data orchestration in Fabric is pipelines. A pipeline is a series of steps that move data from one place to another, in this case, from one layer of the medallion architecture to the next. Pipelines can be automated to run on a schedule or triggered by an event.

Secure your lakehouse by ensuring that only authorized users can access data. In Fabric, you can do this by setting permissions at the workspace or item level.

Workspace permissions control access to all items within a workspace.

Item-level permissions control access to specific items within a workspace, and could be used when you're collaborating with colleagues who aren't in the same workspace, or they only need access to a single, specific item.

Data warehouses are analytical stores built on a relational schema to support SQL queries.

领英推荐

Databricks SQL Series — Part 5 — Managing and Securing…

Krishna Yogi Kolluru 7 个月前

How to leverage Column Level Lineage on Airflow

Alvin 2 年前

Databricks vs. AWS Lakehouse

Xorbix Technologies, Inc. 3 个月前

The process of building a modern data warehouse typically consists of:

Data Ingestion
Data Storage
Data processing
Data analysis & delivery

Design a data warehouse

Fact tables: Contains the numerical data that you want to analyze.

Dimension tables: Contain descriptive information about the data in the fact tables.

Common for a dimension table to include two key columns:

A surrogate key is a unique identifier for each row in the dimension table.
An alternate key is often a natural or business key that identifies a specific instance of an entity in the transactional source system - such as a product code or a customer ID.

Special types of dimension tables:

Time dimensions
Slowly changing dimensions

By performing normalization we can reduce duplication. In a data warehouse, however, the dimension data is generally de-normalized to reduce the number of joins required to query the data.

Fabrics's Lakehouse is a collection of files, folders, tables, and shortcuts that act like a database over a data lake.

There are a few ways to ingest data into a Fabric data warehouse, including Pipelines, Dataflows, cross-database querying, and COPY INTO command.

Security

Role-based access control (RBAC) to control access to the warehouse and its data.
SSL encryption to secure the communication between the warehouse and the client applications.
Azure Storage Service Encryption to protect the data in transit and at rest.
Azure Monitor and Azure Log Analytics to monitor the warehouse activity and audit the access to the data.
Multifactor authentication (MFA) to add an extra layer of security to user accounts.
Microsoft Entra ID integration to manage the user identities and access to the warehouse.

The data warehouse in Microsoft Fabric is powered up with Synapse Analytics by offering a rich set of features that make it easier to manage and analyze data.

Data ingestion/extract is about moving raw data from various sources into a central repository.

All Fabric data items like data warehouses and lakehouses store their data automatically in OneLake in Delta Parquet format.

Stage your data

You may have to build and work with auxiliary objects involved in a load operation such as tables, stored procedures, and functions. These auxiliary objects are commonly referred to as staging. Staging objects act as temporary storage and transformation areas. They can share resources with a data warehouse, or live in their own storage area.

Staging serves as an abstraction layer, simplifying and facilitating the load operation to the final tables in the data warehouse

Types of Data Loads

Full (Initial) Load: The process of populating the data warehouse for the first time. All the tables are truncated and reloaded, and the old data is lost. It may take longer to complete due to the amount of data being handled. Easier to implement as there's no history preserved.
Incremental Load: The process of updating the data warehouse with the changes since the last update. The history is preserved, and tables are updated with new information. Takes less time than the initial load. Implementation is more complex than the initial load.

There are several types of slowly changing dimensions in a data warehouse, with type 1 and type 2 being the most frequently used.

Type 0 SCD: The dimension attributes never change.
Type 1 SCD: Overwrites existing data, doesn't keep history.
Type 2 SCD: Adds new records for changes, and keeps full history for a given natural key.
Type 3 SCD: History is added as a new column.
Type 4 SCD: A new dimension is added.
Type 5 SCD: When certain attributes of a large dimension change over time, but using type 2 isn't feasible due to the dimension’s large size.
Type 6 SCD: Combination of type 2 and type 3.

The mechanism for detecting changes in source systems is crucial for determining when records are inserted, updated, or deleted. Change Data Capture (CDC), change tracking, and triggers are all features available for managing data tracking in source systems such as SQL Server.

When it comes to loading data in a data warehouse, there are several considerations to keep in mind.

Load volume & frequence - Assess data volume & load frequency to optimize performance.
Governance - Any data that lands in OneLake is governed by default.
Data mapping - Manage mapping from source to staging to warehouse.
Dependencies - Understand dependencies in the data model for loading dimensions.
Script design - Design efficient import scripts considering column names, filtering rules, value mapping, and database indexing.

Power BI Report Optimization Techniques

Use Performance Analyzer.
Troubleshoot DAX performance by using DAX Studio.
Optimize the data model.
Optimize a data model by using Best Practice Analyzer.

The Deployment Pipeline tool enables users to manage the development lifecycle of content within their tenant. The feature is available within the Power BI Service with a Premium Capacity license.

Microsoft Fabric is a SaaS solution for end-to-end data analytics.

Understand Fabric concepts: tenant, capacity, domain, workspace, and item

A Fabric tenant is a dedicated space for organizations to create, store, and manage Fabric items. There's often a single instance of Fabric for an organization, and it's aligned with Microsoft Entra ID. The Fabric tenant maps to the root of OneLake and is at the top level of the hierarchy.

Capacity is a dedicated set of resources that is available at a given time to be used. A tenant can have one or more capacities associated with it. Capacity defines the ability of a resource to perform an activity or to produce output. Different items consume different capacities at a certain time. Fabric offers capacity through the Fabric SKU and Trials.

A domain is a logical grouping of workspaces. Domains are used to organize items in a way that makes sense for your organization. You can group things together in a way that makes it easier for the right people to have access to the right workspaces. For example, you might have a domain for sales, another for marketing, and another for finance.

A workspace is a collection of items that bring together different functionality in a single tenant. It acts as a container that leverages capacity for the work that is executed and provides controls for who can access the items in it. For example, in a sales workspace, users associated with the sales organization can create a data warehouse, run notebooks, create semantic models, create reports, etc.

Fabric items are the building blocks of the Fabric platform. They're the objects that you create and manage in Fabric. There are different types of items, such as data warehouses, data pipelines, semantic models, reports, and dashboards.

Describe admin tasks:

Security and access control
Data governance
Customization and configuration
Monitoring and optimization

Fabric has a few built-in governance features to help you manage and control your data. Endorsement is a way for you as an admin to designate specific Fabric items as trusted and approved for use across the organization.

Admins can also make use of the scanner API to scan Fabric items for sensitive data, and the data lineage feature to track the flow of data through Fabric.

Resource:

https://learn.microsoft.com/en-us/collections/jkqrh725262g?WT.mc_id=cloudskillschallenge_b696c18d-7201-4aff-9c7d-d33014d93b25&sharingId=DD14128009B49F23

Journey into Data Engineering

646 位关注者

要查看或添加评论，请登录

Vidushraj Chandrasekaran的更多文章

What is Data Build Tool (dbt)?

2024年3月25日

What is Data Build Tool (dbt)?

dbt, or Data Build Tool, revolutionizes data transformation by simplifying the process through SQL select statements…

1 条评论
Top 10 Essential Azure Data Engineering Services for Data Engineers

2023年12月7日

Top 10 Essential Azure Data Engineering Services for Data Engineers

Data engineering is the backbone of data-driven decision-making, focusing on the collection, transformation, and…
Airflow Variables Decoded: Maximizing workflow efficiency

2023年11月21日

Airflow Variables Decoded: Maximizing workflow efficiency

Variables play a pivotal role in enhancing workflow flexibility and security. They enable dynamic parameterization and…
Understanding the DummyOperator in Apache Airflow: A Simple Guide

2023年11月17日

Understanding the DummyOperator in Apache Airflow: A Simple Guide

Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. It…

Getting to Know Microsoft Fabric: An Introduction