Orchestrate workflows using Managed Airflow in Azure Data Factory

Orchestrate workflows using Managed Airflow in Azure Data Factory

Microsoft Azure has hit a new height this year, by introducing "Apache Airflow" into Azure Data Factory (ADF) environment. This means the best of both world at one place. This integration serves the perfect harmony to orchestrate data pipelines from multiple data streams. Just by creating a DAG file (Directed Acyclic Graph), which are nothing but a collection of tasks with their dependencies, one could start authoring, scheduling, and monitoring workflows.

One would wonder why they would require Airflow if they were already adept in scheduling and triggering tasks within the workspace, as well as organizing data flows within ADF. Airflow comes as a bonus to ADF users, when it comes to handling complex data pipelines across multiple data sources. It offers numerous benefits, as detailed below.

1. Airflow enables micro-level task management, with greater transparency in task dependencies. When managing several workflows, it enables control over daily batch processing. Airflow allows integration outside Azure services, which otherwise was hard to implement.??

2. Airflow comes with wonderful UI which help authoring workflow visually.

3. It integrates very well with Azure monitor where one could view task delays or workflow errors using Azure's Airflow metrics. ?

4. One don't need to bother about managing underlying infrastructure w.r.t updates, scalability, availability and security as it a managed service from Microsoft Azure. Most of the periodic updates such as patches or version update happens automatically in the environment. ?

5. Furthermore, Azure's managed airflow supports open-source integration with hundreds of operators and sensors. These packages (providers) are maintained by the community and they are available for use. Below are some of the examples of Airflow providers.?

? Amazon provider

? Snowflake provider

? Google provider

? Azure provider

? Databricks provider

? Fivetran provider

It also integrates well within Azure services such as Azure Data Factory pipelines, Azure Key Vault, Azure Batch etc. As we have seen, there is enough reason to make friend with Managed Airflow in Azure Data Factory.


The big picture of Azure's Open Data Integration, and where the Airflow fits into the ADF data integration landscape, is shown below (Image from Microsoft).

No alt text provided for this image
Courtesy: Microsoft Azure
Tip: 
For those, who are new to Airflow, here is a short brief about its origin. 
Apache Airflow is an open-source tool developed by Airbnb and licensed by 
Apache.org. This is used to create schedules, and monitor data workflows. 
This is done using DAG (Directed Acyclic Graph) files, which are collections 
of tasks, organized according to their relationships and dependencies.

The DAG file is created using Python script, which defines tasks and their 
dependencies (the DAG structure) as code. If you are not an expert in Python,
this could be a tedious task. Nevertheless, in later section of this article,
I explain how to automate and standardize Python-based DAGs by using 
"generic" DAG templates in SQL database, and substitute actual values 
into placeholders.        

As we know more about Airflow, let’s get started with using it on Azure portal.

Setup

Once your Azure environment is set up, by signing up for "Azure subscription", one could create "Data Factory" workspace from the available Azure services. In the manage section within the workspace, the new integration runtime "Airflow" is previewed, as shown below.?

No alt text provided for this image

Managed Airflow

In below steps, we start setting up the Airflow environment, create DAG files, launch Airflow UI and finally monitor the environment using Azure monitor.

  1. Setup Airflow

From the Airflow (preview) section, one could initiate the new Airflow environment setup, as shown below.

No alt text provided for this image

Once you have named your integration runtime, one must pay attention to the authentication type. The default airflow authentication type is Azure AD, which would not require username and password to login into managed Airflow UI. On the other hand, when using Basic authentication, credentials are needed, hence one should secure them safely, which would be needed later to login into the Airflow UI. In this setup, we use basic authentication.

To perform various integration operations, one must choose appropriate provider(s). In this case, we use Azure provider to access ADF pipelines. One could add one or multiple providers for an environment, as shown below.

No alt text provided for this image

More information about the Microsoft Azure provider, is found under below link.

apache-airflow-providers-microsoft-azure


By clicking the create button, you would initiate the creation of Airflow integration runtime. It takes couple of minutes for the setup to complete. When it is ready, the status would change to "Running", as shown below.

No alt text provided for this image

From this point onwards, we are ready to start using the Airflow integration to orchestrate data pipelines within Azure Data Factory.

Tip: 
Before we start creating DAG files, we should thoroughly understand some of 
the principles of Airflow, including its concepts, objects, and applications.

Airflow fundamentals:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial/fundamentals.html        

2. Create DAG files

Now that we have the Managed Airflow running in ADF workspace, let's put this into work, by creating DAG files.?

A ?directed acyclic graph (DAG) file is created using python script which consists of one or more tasks with dependencies. In Airflow, a task is an executable piece of operator such as Bash operator, Python operator, Email operator, Http Operator, MS SQL operator and so on. Within a graph, these task becomes node, which executes a unit of work,?and each vertices represents a dependency between the tasks. One could see the relationships between these tasks visually in the Airflow UI.?We will create couple of DAG files and start using basic operators.

DAG File 1: Create HelloWorld

In this example, we use a PythonOperator to execute a python command, as shown below.

No alt text provided for this image
DAG File 1

DAG File 2: Create Task Dependency

Similarly by introducing more task calls, we could include task dependency, as shown below.

No alt text provided for this image
No alt text provided for this image
DAG File 2

Both T2 and T3 are dependent on T1. Only after successful execution of task T1, the control will be passed on further.?

DAG File 3: Running existing ADF pipeline

Before running a ADF pipeline in Airflow, one need to configure the connection using the Airflow UI ( under Admin -> Connections option) and choose appropriate 'connection type', which in this case is 'Azure Data Factory'.


One need to acquire following keys and attributes as mentioned below, to get connected to ADF using Airflow.

client_id, client_secret,?tenant_id,?subscription_id,?resource_group_name,?

data_factory_name, and?pipeline_name.


One could get hold of these keys and data pipeline details, by registering an application in ADF workspace, as shown below.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Below is the DAG segment, which runs an existing pipeline via Airflow using above keys and attributes.

No alt text provided for this image

One could create dependencies between tasks, as shown below.

No alt text provided for this image
DAG File 3
Tip: 
For more details, about running an existing data pipeline, please refer to 
the below link as well.

https://learn.microsoft.com/en-us/azure/data-factory/tutorial-run-existing-pipeline-with-airflow        

Once ready with the DAG files, these should be uploaded into a 'Dags' folder on Azure storage as shown below.

No alt text provided for this image
No alt text provided for this image

Finally, these DAG files are put to work by importing them into a running Airflow environment (by using the Import files option from the Airflow environment page). One may choose to include multiple tasks, in 1 DAG file with their dependencies or create 1 DAG file each for a given task.

No alt text provided for this image
Tip: 
One could think of a "Data Model" to hold tasks (nodes) and his 
dependencies (vertices) and automate the creation of DAG files 
in SQL Database.?

Using pre-defined templates per project, one could customize the Dag file 
structure.?Using SQL Stored procedures, the nodes and vertices from the 
template are substituted with actual values, finalizing the Dag file.        

3. Launch Monitor UI

From the running Airflow environment, one could launch Airflow UI to monitor end-to-end data pipelines as shown below.

No alt text provided for this image

Airflow provides single sign-in using Azure AD authentication. For basic authentication, one needs to authenticate using username and password as shown below.

No alt text provided for this image

Once you enter into the Airflow UI page, one would see the DAG files listed with its run status as shown below.

No alt text provided for this image

By clicking on the DAG name, one could see the task dependencies, for instance the dependencies in DAG file 2 & 3 are shown below.?

DAG file 2:

No alt text provided for this image

DAG file 3:

No alt text provided for this image

4. Define Metrics and alerts

By selecting appropriate metrics of interest and plotting them, one could create monitoring dashboard and alerts.

No alt text provided for this image

One should pay attention to the cost incurred by these services.?Once you are finished with processing, it is advisable to delete Airflow environment.

No alt text provided for this image
Tip: 
The pricing of managed Airflow from ADF is found under this link.
https://learn.microsoft.com/en-us/azure/data-factory/airflow-pricing        

Conclusion

Managed Airflow in ADF has opened up doors for many custom integration possibilities from outside Azure, to orchestrate data pipelines from multiple data streams. This gives full use of Airflow's orchestration features, going beyond what can be achieved using ADF alone.

ADF users simply needs to create workflows as?DAGs (Directed Acyclic Graphs) and upload them into the Airflow environment. These DAGs files are picked up by Airflow’s powerful User Interface, easing the visualizing of pipelines, its dependencies, tracking progress and help resolving issues visually. In addition to managed resources, Airflow natively integrates well with Azure Active Directory for single sign-on, providing a more secure solution. Finally, leaving you with this wonderful quote …

?"Those who have a 'why' to live, can bear with almost any 'how'. "?–Viktor Frankl

要查看或添加评论,请登录

社区洞察

其他会员也浏览了