Warping through Data pipelines
Mathias Halkj?r Petersen
Data & Business Intelligence | Microsoft Data Platform MVP—Sharing my stories, tips and tricks
A comprehensive guide to using Data pipelines in Microsoft Fabric
Want to retrieve numerous tables from databases across multiple data sources in a fairly simple, reliable, and scalable manner?
Then, read along for this deep-dive into Data pipelines in Microsoft Fabric. In this blog post, I hope you'll learn what Data pipelines are, what components they const of, and why they are extremely useful. In this guide, we will explore the basics of Data pipelines, making sure that you understand the features available to you.
Data pipelines are found in the Data Factory, Data Engineering, or the Data Warehouse experience in Microsoft Fabric. The breadth of the experiences it is included in is a sign of its integral part of any data workflow.
Data ingestion
To begin building out a full-fledged data platform, we need to start with some actual data. Sure, we can quickly load some data into our datasets or store them in a dataflow for reusability. But if we want to utilize the full power of a data platform, we need a proper way of ingesting and storing data from our sources.
I'm from a Power BI background, why is dataflows not enough?
Power BI Dataflows are fine for what they are built for, reusing the same data across multiple datasets without reloading it from the source multiple times. Sure, with a few hacks here and there, we can store its data in a data lake, enable incremental loads, and even set up historical snapshot capturing. Still, they need more inherent scalability and flexibility of dedicated data ingestion systems. Moreover, their makeshift approaches to historical data management, coupled with limited error handling, logging, and monitoring capabilities, can lead to inefficiencies and potential data issues.
With Dataflows Gen2, a lot of these issues are easier to solve. However, ultimately, we are still limited to overwriting or appending each table to one destination - and setting this configuration up manually one table at a time. Fine if you have a handful of tables, dreadful if you have 200 tables.
Data pipelines at a glance
At a glance, data pipelines consist of 4 components: schedule, activities, variables/parameters, and monitoring/logging. It also comes with a connection component, but as you will learn later, that component is not exclusive to Data pipelines.
Activities
The cornerstone of Data pipelines is the activities that allow us to do anything from copying data, deleting data, controlling the workflow itself, orchestrating other compute jobs, to sending out communication messages with Outlook and teams—Activities consist of 4 types:
?? Data Movement and Transform activities are the bread and butter of Data pipelines, and the most essential activity must be the "copy activity". Its name doesn't give much away of its potency, and neither does clicking through its UI because its true power lies somewhere else.
The true power of the copy activity lies in all its connectors and how each of them has a vast set of configurations and setups - specific to each type of source. So if you are connecting to a REST API, you can enjoy features like dynamic endpoint URL handling, Throttle setup, pagination rules, and more, while if you are connecting to a SQL server, you can do things like copying full tables, dynamic querying, running stored procedures and more—Each connector is its own Swiss knife of tools to read and write data for that specific source.
Furthermore, in Microsoft Fabric (As opposed to Azure Data Factory), pipelines also enable reading from and writing back to delta tables through the fabric storage artifacts, Lakehouse, and Warehouse, as well as supporting KQL databases.
Aside from the copy activity, some notable mentions include the following:
?? Orchestration activities help make sure that individual workflows are scheduled and synchronized. If we utilize the orchestration capabilities in Data pipelines properly, we can avoid issues like datasets updating too late after the last data warehouse transform activities, or even worse DURING.
Data pipelines have a lot of flexibility for orchestration, with a fallback to REST API calls with the web activity. Out-of-the-box, we can orchestrate the following: Dataflows, Notebooks, Stored procedures (SQL), Webhooks, Azure Functions, Azure Batch, other pipelines, and more.
There is no specific activity to refresh a Power BI dataset. Still, we can effectively orchestrate a Power BI dataset refresh anyway by using the Web activity and calling the Power BI REST API instead.
??? Control flow activities, finally, add a whole world of flexibility and dynamism to the pipelines. Instead of our workflows being static and linear, we can make them dynamic, branching out, iterative, and even implement feedback loops—especially paired with parameters, variables, and activity dependencies!
Parameters, Variables, Activity output, and Dynamic expressions
The different activities offer very flexible capabilities, and together with control flow features, it is possible to create complex and intricate workflows, but they all still have a static or hardcoded feel to them. This is where parameters and variables enter the picture.
They are both used to hold and manipulate values during the workflow process. Still, they serve different purposes and have different scopes and lifetimes.
Parameters in Azure Data Factory are immutable and used for making pipelines and datasets more dynamic and reusable. Their values are set at the start and remain constant throughout the execution. They can exist at multiple scopes, such as pipeline, dataset, and linked service levels, and can be optional or required. Despite being immutable, parameters can still act as pointers to other input fields, making them less static in practice.
Variables, in contrast, are mutable and scoped within a single pipeline. They serve as temporary storage for values that can change during the pipeline's execution. Activities like "Set Variable" and "Increment Variable" can modify their values, making them useful for iterative processes and calculations.
领英推荐
To wrap it up, we can use dynamic expressions, which let us utilize the values from parameters and variables in our actual loads. So a variable could control which table we load data from, parameters could change what database or REST API endpoint we use as a source, and so on.
Finally, dynamic expression can even use the output of prior activities as their values, enabling us to make pipelines driven by data.
Activity dependencies
Activities don't work in isolation only; they can trigger each other with activity dependencies, too. Suppose an activity does not have any upstream dependencies - in that case, it will run when the pipeline is executed (to the extent parallelization will allow it). But if we want to control the flow of which activity runs in a specific sequence or pattern, we can use activity dependencies.
There are 4 different types of dependency, explained below:
On skip will trigger when the upstream activity is skipped due to one of its dependencies making it not run, for example, if the activity before it fails. If a dependent activity's conditions aren't met, it's marked as "Skipped" rather than executed. This status helps identify activities in a pipeline that did not run due to unmet dependency conditions.
On success will trigger the next activity if the current activity runs without errors.
On fail will trigger the next activity if the current activity fails.
On completion will trigger the next activity either if the current fails or succeeds, as long as it has run and is "complete".
Activity dependencies do not need to be strictly linear. We can create intricate workflows with feedback loops, retries, and more. If not carefully considered, such workflows may become overly complex, though.
Connections
Connections in Microsoft Fabric allow us to connect our data pipelines to actual data sources. Three types can be found on the connection management page: On-premises, Virtual network, and Cloud. At the time of writing, only Cloud connections work with Data Pipelines, but the others are on the roadmap. (https://aka.ms/FabricRoadmap)
Connections are shared between multiple tools in Microsoft Fabric, including Power BI and Data Pipelines. This is an important distinction to make because once a connection is created, modifying it may have consequences beyond the single tool. For example, if I update my connection from authenticating with read/write access to read-only access for my data warehouse. In that case, my pipelines will be blocked from writing to that data warehouse. As a rule of thumb, separate connections should be made for read and write workloads, and read-only access should be given to data load connections.
Schedule (Trigger)
Having a neat Data pipeline is excellent, but without running it, nothing will happen. We can manually run a pipeline, as well as set up a schedule, but compared to Azure Data Factory, our choice of triggers could be more extensive. Available trigger options:
Schedules can only run by the minute, hourly, daily, or weekly. Pipelines can still orchestrate other pipelines, but none of the following triggers are implemented:
So, despite some limitations, we still have a dynamic range of triggers, and especially being able to trigger Data pipelines from another pipeline and pass parameters between them allows us to design advanced patterns and hierarchal pipelines.
Monitoring and logging
We can monitor all of your pipeline runs natively in the user experience. Go to the pipeline and click "View run history", to see the pipeline-specific run history:
Alternatively, in the monitoring hub, we can get a complete overview across all of Microsoft Fabric:
After being through a lot of the different activities, connections, monitoring, and schedules in Data Pipelines, I hope you are left with the impression that each and every one of them is its own building block - building blocks that can be combined, mixed, stacked and used together to unlock numerous use-cases and create robust and powerful data handling and orchestration solutions.
Some of the most typical use cases for Data Pipelines involve:
* Be aware that concurrency in Data Pipelines is limited by the size if your capacity - and will fail rather than queue if the limit is exceeded.
References:
Empowering Data-driven businesses | Analytics & GenAI | Ex-JPMorgan | Ex-Credit Suisse | Research scholar - IIM Indore | MBA-IIM Lucknow
1 年Great Newsletter Mathias.???? Thanks for sharing these insights. Wish you many fold subscribers. ???? Here is a more agile and flexible Data Fabric from Clarista Inc. Explore Intro video: https://youtu.be/bvAgVpSa3PE