登录查看更多内容

Warping through Data pipelines

Mathias Halkj?r Petersen

Data & Business Intelligence | Microsoft Data Platform MVP—Sharing my stories, tips and tricks

发布日期: 2023年10月28日

A comprehensive guide to using Data pipelines in Microsoft Fabric

Want to retrieve numerous tables from databases across multiple data sources in a fairly simple, reliable, and scalable manner?

Data Factory (incl. Data Pipelines) role in Microsoft Fabric

Then, read along for this deep-dive into Data pipelines in Microsoft Fabric. In this blog post, I hope you'll learn what Data pipelines are, what components they const of, and why they are extremely useful. In this guide, we will explore the basics of Data pipelines, making sure that you understand the features available to you.

Data pipelines are found in the Data Factory, Data Engineering, or the Data Warehouse experience in Microsoft Fabric. The breadth of the experiences it is included in is a sign of its integral part of any data workflow.

Experiences in Microsoft Fabric with Data pipelines

Data ingestion

To begin building out a full-fledged data platform, we need to start with some actual data. Sure, we can quickly load some data into our datasets or store them in a dataflow for reusability. But if we want to utilize the full power of a data platform, we need a proper way of ingesting and storing data from our sources.

I'm from a Power BI background, why is dataflows not enough?

Power BI Dataflows are fine for what they are built for, reusing the same data across multiple datasets without reloading it from the source multiple times. Sure, with a few hacks here and there, we can store its data in a data lake, enable incremental loads, and even set up historical snapshot capturing. Still, they need more inherent scalability and flexibility of dedicated data ingestion systems. Moreover, their makeshift approaches to historical data management, coupled with limited error handling, logging, and monitoring capabilities, can lead to inefficiencies and potential data issues.

With Dataflows Gen2, a lot of these issues are easier to solve. However, ultimately, we are still limited to overwriting or appending each table to one destination - and setting this configuration up manually one table at a time. Fine if you have a handful of tables, dreadful if you have 200 tables.

Data pipelines at a glance

At a glance, data pipelines consist of 4 components: schedule, activities, variables/parameters, and monitoring/logging. It also comes with a connection component, but as you will learn later, that component is not exclusive to Data pipelines.

Activities

The cornerstone of Data pipelines is the activities that allow us to do anything from copying data, deleting data, controlling the workflow itself, orchestrating other compute jobs, to sending out communication messages with Outlook and teams—Activities consist of 4 types:

Data movement
Transform
Orchestrate
Control flow

?? Data Movement and Transform activities are the bread and butter of Data pipelines, and the most essential activity must be the "copy activity". Its name doesn't give much away of its potency, and neither does clicking through its UI because its true power lies somewhere else.

The true power of the copy activity lies in all its connectors and how each of them has a vast set of configurations and setups - specific to each type of source. So if you are connecting to a REST API, you can enjoy features like dynamic endpoint URL handling, Throttle setup, pagination rules, and more, while if you are connecting to a SQL server, you can do things like copying full tables, dynamic querying, running stored procedures and more—Each connector is its own Swiss knife of tools to read and write data for that specific source.

Furthermore, in Microsoft Fabric (As opposed to Azure Data Factory), pipelines also enable reading from and writing back to delta tables through the fabric storage artifacts, Lakehouse, and Warehouse, as well as supporting KQL databases.

Sources and destinations (destinations marked with green line)

Aside from the copy activity, some notable mentions include the following:

Stored procedure lets you execute stored procedures in your linked databases. It is especially useful when you have pre-defined operations in a database that need to be triggered as part of your data integration flow.
Script allows the users to run custom code scripts, queries, and non-queries to a data source. Nifty for a quick SQL UPDATE statement, CREATE TABLE or other SQL DML/DDL.
Web activity can be utilized to make HTTP requests to REST endpoints. It supports various HTTP methods and can be used to start or monitor other workflows or to fetch data from a web endpoint.
Lookup activity retrieves a dataset from supported sources such as Data Lake, Azure SQL, Fabric, etc. This activity is typically used to get reference data or configurations that will be used by subsequent activities in the pipeline.

?? Orchestration activities help make sure that individual workflows are scheduled and synchronized. If we utilize the orchestration capabilities in Data pipelines properly, we can avoid issues like datasets updating too late after the last data warehouse transform activities, or even worse DURING.

Example of dynamic orchestration of a medallion architecture

Data pipelines have a lot of flexibility for orchestration, with a fallback to REST API calls with the web activity. Out-of-the-box, we can orchestrate the following: Dataflows, Notebooks, Stored procedures (SQL), Webhooks, Azure Functions, Azure Batch, other pipelines, and more.

There is no specific activity to refresh a Power BI dataset. Still, we can effectively orchestrate a Power BI dataset refresh anyway by using the Web activity and calling the Power BI REST API instead.

??? Control flow activities, finally, add a whole world of flexibility and dynamism to the pipelines. Instead of our workflows being static and linear, we can make them dynamic, branching out, iterative, and even implement feedback loops—especially paired with parameters, variables, and activity dependencies!

Control flow activities in Data Pipelines

Parameters, Variables, Activity output, and Dynamic expressions

The different activities offer very flexible capabilities, and together with control flow features, it is possible to create complex and intricate workflows, but they all still have a static or hardcoded feel to them. This is where parameters and variables enter the picture.

They are both used to hold and manipulate values during the workflow process. Still, they serve different purposes and have different scopes and lifetimes.

Parameters in Azure Data Factory are immutable and used for making pipelines and datasets more dynamic and reusable. Their values are set at the start and remain constant throughout the execution. They can exist at multiple scopes, such as pipeline, dataset, and linked service levels, and can be optional or required. Despite being immutable, parameters can still act as pointers to other input fields, making them less static in practice.

Variables, in contrast, are mutable and scoped within a single pipeline. They serve as temporary storage for values that can change during the pipeline's execution. Activities like "Set Variable" and "Increment Variable" can modify their values, making them useful for iterative processes and calculations.

ITC Infotech 1 年前

Druid Apache: Revolutionizing Real-Time Data Analytics

Data & Analytics 7 个月前

Unpacking Microsoft Fabric: A Power BI Perspective

Data & Analytics 3 个月前

To wrap it up, we can use dynamic expressions, which let us utilize the values from parameters and variables in our actual loads. So a variable could control which table we load data from, parameters could change what database or REST API endpoint we use as a source, and so on.

Dynamic expression using a variable as query setting

Finally, dynamic expression can even use the output of prior activities as their values, enabling us to make pipelines driven by data.

Dynamic expression using an output to control the targeted table

Activity dependencies

Activities don't work in isolation only; they can trigger each other with activity dependencies, too. Suppose an activity does not have any upstream dependencies - in that case, it will run when the pipeline is executed (to the extent parallelization will allow it). But if we want to control the flow of which activity runs in a specific sequence or pattern, we can use activity dependencies.

There are 4 different types of dependency, explained below:

On skip will trigger when the upstream activity is skipped due to one of its dependencies making it not run, for example, if the activity before it fails. If a dependent activity's conditions aren't met, it's marked as "Skipped" rather than executed. This status helps identify activities in a pipeline that did not run due to unmet dependency conditions.

On success will trigger the next activity if the current activity runs without errors.

On fail will trigger the next activity if the current activity fails.

On completion will trigger the next activity either if the current fails or succeeds, as long as it has run and is "complete".

Activity dependencies do not need to be strictly linear. We can create intricate workflows with feedback loops, retries, and more. If not carefully considered, such workflows may become overly complex, though.

Connections

Connections in Microsoft Fabric allow us to connect our data pipelines to actual data sources. Three types can be found on the connection management page: On-premises, Virtual network, and Cloud. At the time of writing, only Cloud connections work with Data Pipelines, but the others are on the roadmap. (https://aka.ms/FabricRoadmap)

Connections are shared between multiple tools in Microsoft Fabric, including Power BI and Data Pipelines. This is an important distinction to make because once a connection is created, modifying it may have consequences beyond the single tool. For example, if I update my connection from authenticating with read/write access to read-only access for my data warehouse. In that case, my pipelines will be blocked from writing to that data warehouse. As a rule of thumb, separate connections should be made for read and write workloads, and read-only access should be given to data load connections.

Schedule (Trigger)

Having a neat Data pipeline is excellent, but without running it, nothing will happen. We can manually run a pipeline, as well as set up a schedule, but compared to Azure Data Factory, our choice of triggers could be more extensive. Available trigger options:

Scheduled run (by the minute, hourly, daily or weekly)
Execute Pipeline (Data pipelines triggering other data pipelines)
Manual run (Clicking Run in the UI inside the pipeline)

Schedules can only run by the minute, hourly, daily, or weekly. Pipelines can still orchestrate other pipelines, but none of the following triggers are implemented:

Event triggers (Custom [Event grid], Storage)
Tumbling window trigger
API

So, despite some limitations, we still have a dynamic range of triggers, and especially being able to trigger Data pipelines from another pipeline and pass parameters between them allows us to design advanced patterns and hierarchal pipelines.

Monitoring and logging

We can monitor all of your pipeline runs natively in the user experience. Go to the pipeline and click "View run history", to see the pipeline-specific run history:

Alternatively, in the monitoring hub, we can get a complete overview across all of Microsoft Fabric:

After being through a lot of the different activities, connections, monitoring, and schedules in Data Pipelines, I hope you are left with the impression that each and every one of them is its own building block - building blocks that can be combined, mixed, stacked and used together to unlock numerous use-cases and create robust and powerful data handling and orchestration solutions.

Some of the most typical use cases for Data Pipelines involve:

Automated Data movement for data ingestion or integration - especially made powerful with data security features and fault tolerance configurations.
Orchestration of entire end-to-end data pipelines - utilizing dependency management, monitoring, logging, metadata capture, and concurrency management*
Light transformations?and schema mapping - utilizing the built-in mapping capabilities and dynamic querying. Building too much business logic into the Data pipelines is not advisable, as it can hurt both the performance and explainability of your workflows. Data pipelines as a service is not a full-scale transformation engine like Spark, dataflows, or SQL in the warehouse.

* Be aware that concurrency in Data Pipelines is limited by the size if your capacity - and will fail rather than queue if the limit is exceeded.

References:

Monitor copy activity - Azure Data Factory & Azure Synapse | Microsoft Learn

https://aka.ms/FabricRoadmap

Pavan Goyal

1 年

Great Newsletter Mathias.???? Thanks for sharing these insights. Wish you many fold subscribers. ???? Here is a more agile and flexible Data Fabric from Clarista Inc. Explore Intro video: https://youtu.be/bvAgVpSa3PE

1 次回应

查看更多评论

Warping through Data pipelines