What is Apache Airflow?
The Apache Airflow platform allows you to create, schedule and monitor workflows through computer programming. It is a completely open-source solution, very useful for architecting and orchestrating complex data pipelines and task launches.
It has several advantages. First of all, it is a dynamic platform, since anything that can be done with Python code can be done on Airflow.
It is also extensible, thanks to many plugins allowing interaction with most common external systems. It is also possible to create new plugins to meet specific needs.
In addition, Airflow provides elasticity. Data Engineers’ teams can use it to run thousands of different tasks every day.
Workflows are architected and expressed as Directed Acyclic Graphs (DAGs), where each node represents a specific task. Airflow is designed as a “code-first” platform, allowing it to iterate very quickly on workflows. This philosophy offers a high degree of scalability compared to other pipeline tools.
What is Airflow used for?
Airflow can be used for any batch data pipeline, so its use cases are as numerous as they are diverse. Due to its scalability, this platform particularly excels at orchestrating tasks with complex dependencies on multiple external systems.
By writing pipelines in code and using the various plugins available, it is possible to integrate Airflow with any dependent systems from a unified platform for orchestration and monitoring.
As an example, Airflow can be used to aggregate daily sales team updates from Salesforce to send a daily report to company executives.
In addition, the platform can be used to organize and launch Machine Learning tasks running on external Spark clusters. It can also load website or application data to a data warehouse once an hour.
What are the different components of Airflow?
The Airflow architecture is based on several components. Here are the main ones.
The DAGs
In Airflow, pipelines are represented as DAGs (Directed Acyclic Graphs) defined in Python.
A graph is a structure composed of objects (nodes) in which certain pairs of objects are related. They are “Directed”, which means that the edges of the graph are oriented and that they, therefore, represent unidirectional links.
“Acyclic”, because the graphs do not have a circuit. This means that node B downstream of node A cannot also be upstream of node A. This ensures that pipelines do not have infinite loops.
领英推荐
Tasks
Each node in a DAG represents a task. It is a representation of a sequence of tasks to be performed, which constitutes a pipeline. The represented jobs are defined by the operators
The operators
The operators are the building blocks of the Airflow platform. They are used to determine the work done. It can be an individual task (node of a DAG), defining how the task will be executed.
The DAG ensures that the operators are scheduled and executed in a specific order, while the operators define the jobs to be executed at each step of the process.
There are three main categories of operators. First, action operators perform a function. Examples are the PythonOperator or the BashOperator.
Transfer operators allow the transfer of data from a source to a destination, like the S3ToRedshiftOperator.
Finally, the Sensors allow waiting for a condition to be verified. For example, the FileSensor operator can be used to wait for a file to be present in a given folder, before continuing the execution of the pipeline.
Each operator is defined individually. However, operators can communicate information to each other using XComs.
Hooks
On Airflow, Hooks allow interfacing with third-party systems. They allow the connection between APIs and external databases like Hive, S3, GCS, MySQL, and Postgres…
Confidential information, such as login credentials, are kept outside the Hooks. They are stored in an encrypted metadata database associated with the current Airflow instance.
Plugins
Airflow plugins can be described as a combination of Hooks and Operators. They are used to accomplish specific tasks involving an external application.
An example would be transferring data from Salesforce to Redshift. There is an extensive open-source collection of plugins created by the user community, and each user can create plugins to meet their specific needs.
Connections
Connections allow Airflow to store information, allowing it to connect to external systems such as API credentials or tokens.
They are managed directly from the platform’s user interface. The data is encrypted and stored as metadata in a Postgres or MySQL database.