Orchestrate workflows using Managed Airflow in Azure Data Factory
Barani Dakshinamoorthy
Data Engineer at PGB Pensioendiensten, Data-Integration and ETL Consultant at eZee-Solutions.com
Microsoft Azure has hit a new height this year, by introducing "Apache Airflow" into Azure Data Factory (ADF) environment. This means the best of both world at one place. This integration serves the perfect harmony to orchestrate data pipelines from multiple data streams. Just by creating a DAG file (Directed Acyclic Graph), which are nothing but a collection of tasks with their dependencies, one could start authoring, scheduling, and monitoring workflows.
One would wonder why they would require Airflow if they were already adept in scheduling and triggering tasks within the workspace, as well as organizing data flows within ADF. Airflow comes as a bonus to ADF users, when it comes to handling complex data pipelines across multiple data sources. It offers numerous benefits, as detailed below.
1. Airflow enables micro-level task management, with greater transparency in task dependencies. When managing several workflows, it enables control over daily batch processing. Airflow allows integration outside Azure services, which otherwise was hard to implement.??
2. Airflow comes with wonderful UI which help authoring workflow visually.
3. It integrates very well with Azure monitor where one could view task delays or workflow errors using Azure's Airflow metrics. ?
4. One don't need to bother about managing underlying infrastructure w.r.t updates, scalability, availability and security as it a managed service from Microsoft Azure. Most of the periodic updates such as patches or version update happens automatically in the environment. ?
5. Furthermore, Azure's managed airflow supports open-source integration with hundreds of operators and sensors. These packages (providers) are maintained by the community and they are available for use. Below are some of the examples of Airflow providers.?
? Amazon provider
? Snowflake provider
? Google provider
? Azure provider
? Databricks provider
? Fivetran provider
It also integrates well within Azure services such as Azure Data Factory pipelines, Azure Key Vault, Azure Batch etc. As we have seen, there is enough reason to make friend with Managed Airflow in Azure Data Factory.
The big picture of Azure's Open Data Integration, and where the Airflow fits into the ADF data integration landscape, is shown below (Image from Microsoft).
Tip:
For those, who are new to Airflow, here is a short brief about its origin.
Apache Airflow is an open-source tool developed by Airbnb and licensed by
Apache.org. This is used to create schedules, and monitor data workflows.
This is done using DAG (Directed Acyclic Graph) files, which are collections
of tasks, organized according to their relationships and dependencies.
The DAG file is created using Python script, which defines tasks and their
dependencies (the DAG structure) as code. If you are not an expert in Python,
this could be a tedious task. Nevertheless, in later section of this article,
I explain how to automate and standardize Python-based DAGs by using
"generic" DAG templates in SQL database, and substitute actual values
into placeholders.
As we know more about Airflow, let’s get started with using it on Azure portal.
Setup
Once your Azure environment is set up, by signing up for "Azure subscription", one could create "Data Factory" workspace from the available Azure services. In the manage section within the workspace, the new integration runtime "Airflow" is previewed, as shown below.?
Managed Airflow
In below steps, we start setting up the Airflow environment, create DAG files, launch Airflow UI and finally monitor the environment using Azure monitor.
From the Airflow (preview) section, one could initiate the new Airflow environment setup, as shown below.
Once you have named your integration runtime, one must pay attention to the authentication type. The default airflow authentication type is Azure AD, which would not require username and password to login into managed Airflow UI. On the other hand, when using Basic authentication, credentials are needed, hence one should secure them safely, which would be needed later to login into the Airflow UI. In this setup, we use basic authentication.
To perform various integration operations, one must choose appropriate provider(s). In this case, we use Azure provider to access ADF pipelines. One could add one or multiple providers for an environment, as shown below.
More information about the Microsoft Azure provider, is found under below link.
By clicking the create button, you would initiate the creation of Airflow integration runtime. It takes couple of minutes for the setup to complete. When it is ready, the status would change to "Running", as shown below.
From this point onwards, we are ready to start using the Airflow integration to orchestrate data pipelines within Azure Data Factory.
Tip:
Before we start creating DAG files, we should thoroughly understand some of
the principles of Airflow, including its concepts, objects, and applications.
Airflow fundamentals:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial/fundamentals.html
2. Create DAG files
Now that we have the Managed Airflow running in ADF workspace, let's put this into work, by creating DAG files.?
A ?directed acyclic graph (DAG) file is created using python script which consists of one or more tasks with dependencies. In Airflow, a task is an executable piece of operator such as Bash operator, Python operator, Email operator, Http Operator, MS SQL operator and so on. Within a graph, these task becomes node, which executes a unit of work,?and each vertices represents a dependency between the tasks. One could see the relationships between these tasks visually in the Airflow UI.?We will create couple of DAG files and start using basic operators.
DAG File 1: Create HelloWorld
In this example, we use a PythonOperator to execute a python command, as shown below.
DAG File 2: Create Task Dependency
Similarly by introducing more task calls, we could include task dependency, as shown below.
领英推荐
Both T2 and T3 are dependent on T1. Only after successful execution of task T1, the control will be passed on further.?
DAG File 3: Running existing ADF pipeline
Before running a ADF pipeline in Airflow, one need to configure the connection using the Airflow UI ( under Admin -> Connections option) and choose appropriate 'connection type', which in this case is 'Azure Data Factory'.
One need to acquire following keys and attributes as mentioned below, to get connected to ADF using Airflow.
client_id, client_secret,?tenant_id,?subscription_id,?resource_group_name,?
data_factory_name, and?pipeline_name.
One could get hold of these keys and data pipeline details, by registering an application in ADF workspace, as shown below.
Below is the DAG segment, which runs an existing pipeline via Airflow using above keys and attributes.
One could create dependencies between tasks, as shown below.
Tip:
For more details, about running an existing data pipeline, please refer to
the below link as well.
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-run-existing-pipeline-with-airflow
Once ready with the DAG files, these should be uploaded into a 'Dags' folder on Azure storage as shown below.
Finally, these DAG files are put to work by importing them into a running Airflow environment (by using the Import files option from the Airflow environment page). One may choose to include multiple tasks, in 1 DAG file with their dependencies or create 1 DAG file each for a given task.
Tip:
One could think of a "Data Model" to hold tasks (nodes) and his
dependencies (vertices) and automate the creation of DAG files
in SQL Database.?
Using pre-defined templates per project, one could customize the Dag file
structure.?Using SQL Stored procedures, the nodes and vertices from the
template are substituted with actual values, finalizing the Dag file.
3. Launch Monitor UI
From the running Airflow environment, one could launch Airflow UI to monitor end-to-end data pipelines as shown below.
Airflow provides single sign-in using Azure AD authentication. For basic authentication, one needs to authenticate using username and password as shown below.
Once you enter into the Airflow UI page, one would see the DAG files listed with its run status as shown below.
By clicking on the DAG name, one could see the task dependencies, for instance the dependencies in DAG file 2 & 3 are shown below.?
DAG file 2:
DAG file 3:
4. Define Metrics and alerts
By selecting appropriate metrics of interest and plotting them, one could create monitoring dashboard and alerts.
One should pay attention to the cost incurred by these services.?Once you are finished with processing, it is advisable to delete Airflow environment.
Tip:
The pricing of managed Airflow from ADF is found under this link.
https://learn.microsoft.com/en-us/azure/data-factory/airflow-pricing
Conclusion
Managed Airflow in ADF has opened up doors for many custom integration possibilities from outside Azure, to orchestrate data pipelines from multiple data streams. This gives full use of Airflow's orchestration features, going beyond what can be achieved using ADF alone.
ADF users simply needs to create workflows as?DAGs (Directed Acyclic Graphs) and upload them into the Airflow environment. These DAGs files are picked up by Airflow’s powerful User Interface, easing the visualizing of pipelines, its dependencies, tracking progress and help resolving issues visually. In addition to managed resources, Airflow natively integrates well with Azure Active Directory for single sign-on, providing a more secure solution. Finally, leaving you with this wonderful quote …
?"Those who have a 'why' to live, can bear with almost any 'how'. "?–Viktor Frankl