登录查看更多内容

Fabric for Analysts Part 1: Data Ingestion & Preparation

Jon Stjernegaard V?ge

Microsoft MVP | Data Speaker, Trainer & Consultant | Founder @ Fabric Symposium

发布日期: 2023年11月20日

Introduction

While Microsoft Fabric promises a collaborative workspace for Data Analysts, Engineers and Scientists alike, the person with the most to gain, and the least to lose, from the data mesh-like architectural patterns in Fabric is the Data Analyst.

Do you, as a Data Analyst, recognize any of the following challenges?

Getting your own data into an existing data model is time consuming and/or difficult to achieve (Ingest & Prepare)
Making changes to calculations, relationships and content in data models is complex or even impossible (Store & Serve)
Driving actions from insight is difficult due to data quality, availability or reliability issues (Consume & Act)

Then Microsoft Fabric might be the solution for you.

This first part of the article series, 'Microsoft Fabric for Data Analysts', explores how you as a Data Analyst can get started with ingesting your own data into Microsoft Fabric.

The article will first establish common needs and challenges regarding Data Ingestion faced by the Data Analyst, then I will outline three different methods for ingesting and preparing data into Microsoft Fabric, and finally I will provide you with a small flow chart to help you decide which approach to use.

Data Ingestion & Preparation Needs and Challenges

Data Analysts commonly want to integrate new data into a model currently used for reporting. Sometimes the data is already in the cloud, and sometimes the data is a small spreadsheet of handwritten data that came out of a meeting the other day.

However, in the process of solving this need, the analyst will often face one or more of the following challenges:

The data model is created further upstream in tools that the Analyst does not possess the required skills for, or is not allowed to add new data to the model themselves, even if they could.
The analyst is allowed to make changes, but has to adhere to development, deployment and security requirements that the analyst does not know how to accomplish.
The analyst must ask Data Engineering colleagues to help solve one or more of the above, resulting in delayed time to delivery.

In Microsoft Fabric however, the collaborative workspace, low-code tools and OneLake integration provides the Analyst with several tools of their own to solve the Data Ingestion tasks on their own.

Ingest and prepare data with Dataflow Gen2

One of the typical scenarios for data ingestion, is that the analyst wants to integrate some new data in the model, which already lives somewhere in the cloud.

This could be:

Data stored in Excel or CSV files on OneDrive/Sharepoint.
Data from an existing database you already have elsewhere.
Data from an API or 3rd party software that you use.

As an Analyst, one tool you can use for this kind of Data Ingestion in Microsoft Fabric is Dataflow Gen2, which is part of the Data Factory Experience in Fabric.

Dataflow Gen2, the revamped version of Dataflows from the Power BI Service (and basically Power Query in the cloud) keeps existing data connector functionality, while improving performance and adding the possibility of Data Output Destinations.

Creating a Dataflow Gen2

To get started with Dataflow Gen2, go to your Fabric Enabled Workspace (regardless of which Fabric Experience you are entering from). Click “Create New”, select “More Options”, and choose Dataflow Gen2 from the list of Data Factory artifacts:

Connecting to Data

In the Dataflow Gen2 editor, you can connect to data from hundreds of data connectors. It works well with Databases, files stored in OneDrive/Sharepoint and even custom API calls, and transformation is as easy as you know it from Power Query on your desktop.

Data Transformation

Likewise, if you want to, it is possible to apply data transformation already at this stage of the data journey, as part of the ingestion. The editor behaves exactly as in Power Query/Dataflows, with the possibility of cleaning, enriching, and structuring data as you’re used to:

If you want a more visual no-code approach to transformation, the excellent Diagram View / Visual Query builder, allows for an easy to understand overview of data lineage within the dataflow, as well as the possibility to add transformations to your queries:

But you may also want to keep your ingestion clean, and only use the transformation tools to filter out unnecessary data, leaving any complex transformation for later down the line.

Adding a Data Destination

What’s even more exciting however, is the new option to output data to a Data Destination. From an easy to navigate GUI its possible to Overwrite and Append data to New or Existing Tables in either a SQL or Kusto Database in Azure, or to a Microsoft Fabric Warehouse or Lakehouse:

Publish and Schedule your Dataflow Refresh

Finally, after selecting the output destination for each table being loaded, Publish your Dataflow Gen2 with the big green button, and watch it perform its first refresh.

If you want this ingestion process to happen automatically on a schedule, the fastest way to do so, is by finding the Dataflow Settings, and setting up a Refresh Schedule, similarly to how you would schedule Dataflow and Dataset refreshes in the Power BI Service:

However, if you want more control and governance around your data refresh, you may further employ the help of Data Factory Pipelines, as outlined in the section below.

Ingest data and/or orchestrate Dataflow Gen2 with Data Factory Pipelines

Another of the Microsoft Fabric tools which you can use for ingestion of data already hosted in the cloud, and is approachable for you as an Analyst, is Data Pipelines from the Data Factory Experience in Fabric.

Data Pipelines are close in functionality to that of Azure Data Factory and/or Azure Synapse Pipelines, without being able to boast full feature parity… Yet. You can use them for complex ETL processes for ingesting data, or as orchestration pipelines for executing various activities (including refreshing your Dataflow Gen2!).

Creating a Data Pipeline

To get started with Data Pipelines, again go to your Fabric Enabled Workspace, click “Create New”, select “More Options”, and choose Data Pipelines from the list of Data Factory artifacts:

Data Pipelines are built up around ‘Activities’, each which may be configured with settings and parameters for altering their behaviour, and which link together, allowing for creating sequences of conditional activities.

Some of the activities requires more complex setup and in some cases the use of coding to function well, but others can be setup completely no code style.

Ingesting data with Data Pipelines

If you wish to use Data Pipelines for data ingestion, your activity of choice would be the ‘Copy Data’ activity:

领英推荐

Microsoft Fabric - Comprehensive Analytics Solution…

Quadrant Technologies 11 个月前

How to Become a Data Engineer — II

Axel Schwanke 1 年前

What Is Microsoft Fabric? A Complete Guide to…

Ze Learning Labb 2 周前

You can configure it manually:

Or if you prefer, use the Copy Assistant for a more guided experience:

And once your Copy Activity is on your Canvas, the pipeline can be scheduled to run at an interval you determine, and will ingest data into your selected destination automatically:

Improve Governance by using Data Pipelines to Orchestrate ingestion & preparation

The true power of Data Pipelines does however not come to show in single activity pipelines like the one created here. Data Pipelines truly shine when used to conditionally sequence multiple activities, and orchestrate several actions.

One may for example run another Activity after the Copy Data activity Succeeds. E.g., running a Stored Procedure on the Data Warehouse:

Or decide to run our previously created Dataflow Gen2 if the Copy Data activity has failed:

And create further governance by communicating the success/failure of our flows automatically by leveraging the Outlook and Teams activities, or even triggering a Power Automate flow with the Web Activity:

While the above examples only scratch the surface of what a true master of Data Pipelines may achieve with dynamic pipelines, it is hopefully clear how we can use them to provide more granular control of our data ingestion and the governance around it.

Manually loading files into OneLake

On the opposite end of the governance spectrum lies the concept of manually ingestion offline files into OneLake.

Sometimes it is just as valid a use case to bring files from your Local Machine or an on-prem Network Drive into your data solution. And while it is hardly a well-governed and structured approach, it is easily accomplished in Fabric.

With Microsoft Fabric you have two options to do this:

Install OneLake on your computer and use it like you would use OneDrive
Upload files directly to OneLake in your Browser

Using OneLake locally installed as ‘The OneDrive of Data’ for manual file upload

Just like OneDrive, it is possible to install and use OneLake in the File Explorer of your computer. Simply download and install OneLake, and make sure to run the application to allow navigating your OneLake.

Once installed, you can browse your OneLake in a neat folder structure, allowing you to see all the items stored in there:

To upload local files to your OneLake you must have a Fabric Lakehouse available in your tenant (we’ll discuss the creation of Warehouses and Lakehouses in the next article of the series).

Then simply navigate to the ‘Files’ folder of your Lakehouse, and copy your local file (preferably in CSV format), into the folder:

And after synchronizing, the files will be available within your Lakehouse:

The final step for ingestion here, is to load the CSV file into a new or existing table, and your manually ingested data will be available for you to use in Data Models built on the Lakehouse:

Using manual file upload directly in Fabric Lakehouse

Alternatively to the approach above, it is possible to upload files directly to the Fabric Lakehouse inside your browser.

To do this, navigate to your Lakehouse and find the “Upload Files” option by right clicking the ‘Files’ folder:

Select the file you wish to upload, and allow for a few seconds before refreshing the Files folder, and the file should show in its contents:

Same as before, you can now load said file into a table, and get on with your data modeling!

Summary and Decision Tree

This article have shown three different approaches for ingesting and preparing data into Microsoft Fabric as a Data Analyst:

Dataflows Gen2
Data Pipelines
Direct file upload to OneLake

Each approach has strengths and weaknesses, with varying degrees of complexity, flexibility and durability. Below is a quick flow chart which hopefully can help you decide which tool to use for ingesting data into Microsoft Fabric:

Morten Christensen

"Everyone has the right to a cool Enterprise Architecture!" Manager -Enterprise Architecture at Kvik A/S

1 年

Emil Sejer

Thomas Andersen ??

Technology Advisor focused on putting perspective on Data & Technology | Data with Purpose | Platforms with Impact | People with Power

1 年

This is really interesting and shows how Microsoft Fabric offers a wide range of tools for self-service. This can be the difference between data-related ideas that drowns in planning and prioritization and now being able to execute on ideas in the context that they are conceived. I look forward to more articles in your series Jon

查看更多评论

要查看或添加评论，请登录

Jon Stjernegaard V?ge的更多文章

Crazy or Genius? Ingest data to Microsoft Fabric / OneLake via Import Semantic Models for better performance than Dataflows

2024年5月14日

Crazy or Genius? Ingest data to Microsoft Fabric / OneLake via Import Semantic Models for better performance than Dataflows

Crazy or genius? Ingesting data to Microsoft Fabric / OneLake with Import Semantic Models instead of Dataflow Gen2…

8 条评论
How to setup RLS in Microsoft Fabric / Power BI without breaking DirectLake functionality

2024年3月19日

How to setup RLS in Microsoft Fabric / Power BI without breaking DirectLake functionality

Introduction RLS in Microsoft Fabric is a convoluted topic which at the time of writing, is not very clearly spelled…

6 条评论
Tutorial: Fabric for Analysts - Acting on data with Data Activator

2023年12月11日

Tutorial: Fabric for Analysts - Acting on data with Data Activator

Introduction While Microsoft Fabric promises a collaborative workspace for Data Analysts, Engineers and Scientists…
Tutorial: Fabric for Analysts - Consuming and Reporting

2023年12月5日

Tutorial: Fabric for Analysts - Consuming and Reporting

Introduction While Microsoft Fabric promises a collaborative workspace for Data Analysts, Engineers and Scientists…

1 条评论
Tutorial: Fabric for Analysts - Data Storage & Serving

2023年11月27日

Tutorial: Fabric for Analysts - Data Storage & Serving

Introduction While Microsoft Fabric promises a collaborative workspace for Data Analysts, Engineers and Scientists…
Live data write-back to Microsoft Fabric Data Warehouse and OneLake from Power BI / Power Apps / Power Automate

2023年7月12日

Live data write-back to Microsoft Fabric Data Warehouse and OneLake from Power BI / Power Apps / Power Automate

Introduction Data write-back is a topic dear to me, and you may have come across one of my other articles on the…

5 条评论
Microsoft Fabric & Power Platform - How to get your Dataverse Data into Fabric / OneLake

2023年6月9日

Microsoft Fabric & Power Platform - How to get your Dataverse Data into Fabric / OneLake

Introduction Microsoft Fabric is top of mind everywhere, and its promises for improving the analytic capabilities of…

3 条评论
Guide: Auto-Trigger Button OnSelect Actions in Power Apps when selecting data points in Power BI

2023年5月22日

Guide: Auto-Trigger Button OnSelect Actions in Power Apps when selecting data points in Power BI

Introduction If you have ready some of my previous articles, you know that I find Power Apps embedded in Power BI…

2 条评论
Guide: Data Write-Back directly in PowerPoint Presentations (w. Power BI + Power Apps)

2023年5月15日

Guide: Data Write-Back directly in PowerPoint Presentations (w. Power BI + Power Apps)

Introduction It is now easier than ever to Embed Live Power BI Reports in Power Point presentations. However, an…

8 条评论
Better Power Apps: Patching nested Parent/Child records with Auto-Generated IDs using just one Save Button

2023年3月24日

Better Power Apps: Patching nested Parent/Child records with Auto-Generated IDs using just one Save Button

Introduction This article is the first in a series on best practices for developing better Canvas Power Apps. Have you…

2 条评论

See all articles

Fabric for Analysts Part 1: Data Ingestion & Preparation

Jon Stjernegaard V?ge

Microsoft MVP | Data Speaker, Trainer & Consultant | Founder @ Fabric Symposium

Introduction

Data Ingestion & Preparation Needs and Challenges

Ingest and prepare data with Dataflow Gen2

Ingest data and/or orchestrate Dataflow Gen2 with Data Factory Pipelines

领英推荐

Manually loading files into OneLake

Summary and Decision Tree

Jon Stjernegaard V?ge的更多文章

社区洞察

其他会员也浏览了

ArchiMate suited for emerging Data centric processes?

How to Build a Scalable Data Pipeline for Your Product

Data Transformation 101 - Unlock the True Potential of Your Data with Microsoft Fabric

Featured Consultant of the Week: Ryan Oliveira, Sr. Consultant at Solutia Consulting Microsoft Fabric: Transforming Data Ecosystems

Seamless analytics with Microsoft Fabric

Empowering Decision-Making: Data Lakes vs Contextual Intelligence

Microsoft Fabric: A Comprehensive Guide and FAQ for Seamless Data Analytics

The Vision Behind Compute.AI: Empowering Enterprises for a New Era of Data Intelligence

Empowering Data Teams: Tools and Libraries for Building Effective Semantic Layers

Modernizing Data Classification

Introduction

Data Ingestion & Preparation Needs and Challenges

Ingest and prepare data with Dataflow Gen2

Ingest data and/or orchestrate Dataflow Gen2 with Data Factory Pipelines

领英推荐

Manually loading files into OneLake

Summary and Decision Tree

Jon Stjernegaard V?ge的更多文章

Crazy or Genius? Ingest data to Microsoft Fabric / OneLake via Import Semantic Models for better performance than Dataflows

How to setup RLS in Microsoft Fabric / Power BI without breaking DirectLake functionality

Tutorial: Fabric for Analysts - Acting on data with Data Activator

Tutorial: Fabric for Analysts - Consuming and Reporting

Tutorial: Fabric for Analysts - Data Storage & Serving

Live data write-back to Microsoft Fabric Data Warehouse and OneLake from Power BI / Power Apps / Power Automate

Microsoft Fabric & Power Platform - How to get your Dataverse Data into Fabric / OneLake

Guide: Auto-Trigger Button OnSelect Actions in Power Apps when selecting data points in Power BI

Guide: Data Write-Back directly in PowerPoint Presentations (w. Power BI + Power Apps)

Better Power Apps: Patching nested Parent/Child records with Auto-Generated IDs using just one Save Button

社区洞察

其他会员也浏览了

ArchiMate suited for emerging Data centric processes?

How to Build a Scalable Data Pipeline for Your Product

Data Transformation 101 - Unlock the True Potential of Your Data with Microsoft Fabric

Featured Consultant of the Week: Ryan Oliveira, Sr. Consultant at Solutia Consulting Microsoft Fabric: Transforming Data Ecosystems

Seamless analytics with Microsoft Fabric

Empowering Decision-Making: Data Lakes vs Contextual Intelligence

Microsoft Fabric: A Comprehensive Guide and FAQ for Seamless Data Analytics

The Vision Behind Compute.AI: Empowering Enterprises for a New Era of Data Intelligence

Empowering Data Teams: Tools and Libraries for Building Effective Semantic Layers

Modernizing Data Classification