Airflow: ETL Workflow Management Platform

Airflow is getting very popular for the ETL workflow management (It can be used for other kind of workflow management too, but here our focus is ETL). It is a platform to programmatically author, schedule, and monitor workflows. You can write workflow as DAG of tasks. When you develop ETL pipelines, you have many tasks for which you want to define dependencies and schedule them . There are many traditional tools like Autosys, Cronacle, Control M etc. to achieve the same. Then, why there is Airflow? In this article lets answer this question and deep dive into the key advantages and disadvantages.

First, lets see how you use Airflow for pipelines orchestration. You develop ETL code (it may be SQL, Python , bash or anything as per our need), you group bunch of code together to achieve a specific functionality. That specific functionality is called task. You group logically interdepended tasks together, that is called workflow. in Airflow, such workflows are called DAG, that is directed acyclic graph. There is no loop/cycle. Therefore, workflow has definite execution path. Next, you schedule those DAGs and monitor them.

Now lets see what are the advantages you get using Airflow:

Open Source- Airflow is an open source software. You don't have to pay anything for the license. You can download it free and start using it. This is huge advantage because that helps us to save huge license/software cost.

Pure Python -You write Airflow workflow/DAG code in Python. You are writing our workflow in the programming language instead of using some custom UI or SQL type of script. This gives you lot of flexibility to handle schedule and dependency. If you like programming, definitely you will prefer to use Airflow over any other workflow management tool.

Dynamic- Many times you need to create dynamic pipelines based on incoming data or business needs. Here Airflow comes handy! You can generate the ETL pipeline at run time instead of defining them statically. This is difficult to handle with the traditional scheduling tools. Programming gives you immense power to control flow, change behavior at run time , parametrize , handle exceptions and errors.

Easy Integration of heterogeneous tasks : Airflow has various types of operators like Python ,SQL , BigQuery, Docker, MySQL, Hive and many more. This gives ability to run different kind of program/code in a single DAG. With other scheduling tool (such as Cronacle, Autosys) you may need to create different steps to handle this, but in Airflow different types of code/functionality can be written as separate tasks and call them in a single DAG. This makes easy to code, maintain , and support. Integration of the diverse tasks is really cool.

User Interface: Airflow provides a good graphical interface where you can monitor and admin DAGs. In terms of functionality this looks similar to other scheduling tools ( like Cronacle, Autosys) UI . Here You can execute and monitor DAGs, look at the past runs, get the schedules and log files. You can also view your DAG's python code here.

Airflow has some disadvantages too. You will not be able to see DAG dependencies visually (like you see steps/jobs dependencies/flow in Autosys or Cronacle). If You want to check that, you may have to open up the DAG code and look for it. If you have used Autosys or Cronacle, you may get disappointed here!

As DAG is coded in Python, there are many parameters which need to be set appropriately. Some key parameters are catchup, start date, number of retry. Based on your need, you should be careful in setting those, otherwise it may execute past runs and may not behave as per your expectations. In test environment, if you have have DAGs with cross dependency, you may not be able to test the dependency manually. You have to trigger them automatically as per schedule (by manipulating the schedules) and test it.

Overall Airflow is a good tool for the workflow management. Ability to code in Python is the best feature Airflow has!

Please feel free to write about your experience with Airflow in the comment section. Happy coding and learning!

Rajesh Kumar

Sr. Data Engineer (AI/ML) | Big Data & Database Architect | Software Development | Automation & Analytics Expert

2 年

it's helpful

回复
Sathish Ksheersagar

Data & Analytics | Leadership | Strategy & Road-map | Healthcare | Manufacturing | Architecture & Governance | Cloud Technology | Engineering | Data Science | Data Products | App & Platform SRE | Dev-ops | Automation

4 年

Good Summary Gopal ??

要查看或添加评论,请登录

Gopal Kumar Roy的更多文章

  • NoSQL versus SQL Database

    NoSQL versus SQL Database

    I have been working with SQL and MPP databases since very long time. After working so long, I learnt depth and breadth…

    2 条评论
  • AWS Data Analytics - Specialty exam preparation tips

    AWS Data Analytics - Specialty exam preparation tips

    Last week I passed the AWS Data Analytics - Specialty exam and thought of sharing some of the tips that can be very…

    3 条评论
  • Snowflake: The cloud data warehouse solution with no modeling

    Snowflake: The cloud data warehouse solution with no modeling

    In this article, I am going to talk about the cloud based data warehouse solution Snowflake. I will deep dive into some…

    6 条评论
  • Spark: The most popular big data processing framework

    Spark: The most popular big data processing framework

    Here is my another article related to big data and cloud technologies. In this article, I am going to talk about the…

    5 条评论
  • Why Python is top choice for Data Engineering

    Why Python is top choice for Data Engineering

    Python is one of the most popular programming language. Cloud, Big data and Machine Learning have made it very popular…

    1 条评论
  • Google's BigQuery: Strengths

    Google's BigQuery: Strengths

    Google's cloud offering GCP is increasing its footprint very rapidly. Specifically, GCP's data warehouse service…

    1 条评论
  • AWS Glue- Based on a data Engineer real life experience

    AWS Glue- Based on a data Engineer real life experience

    There is lot of buzz going around cloud technologies.Many organizations are moving to Cloud.

    6 条评论

社区洞察

其他会员也浏览了