Which Airflow syntax do you prefer?

When connecting operators in an Airflow DAG, we have two broad options - use the chevron or use functions. In case of functions, we have two options - set_upstream and set_downstream.

Initially, when I wrote a DAG generator, I used the chevron notation. But, when I had to write a more complex DAG generator, I went for the function approach.

Consider the following two examples.

Using chevron

t_t01 >> t_t02 >> t_t03 >> t_t04 >> [t_t05, t_t06, t_t07, t_t08, t_t09, t_t10, t_t11] >> [t_t12, t_t13] >> [t_t14, t_t15, t_t16] >> t_end        

Using functions

t_t02.set_upstream(t_t01)

t_t03.set_upstream(ue)

t_t04.set_upstream(t_t03)

t_t05.set_upstream(t_t04)
t_t06.set_upstream(t_t04)
t_t07.set_upstream(t_t04)
t_t08.set_upstream(t_t04)
t_t09.set_upstream(t_t04)
t_t10.set_upstream(t_t04)
t_t11.set_upstream(t_t04)

t_t12.set_upstream(t_t05)
t_t12.set_upstream(t_t06)
t_t12.set_upstream(t_t07)
t_t12.set_upstream(t_t08)
t_t12.set_upstream(t_t09)
t_t12.set_upstream(t_t10)
t_t12.set_upstream(t_t11)
t_t13.set_upstream(t_t05)
t_t13.set_upstream(t_t06)
t_t13.set_upstream(t_t07)
t_t13.set_upstream(t_t08)
t_t13.set_upstream(t_t09)
t_t13.set_upstream(t_t10)
t_t13.set_upstream(t_t11)

t_t14.set_upstream(t_t12)
t_t14.set_upstream(t_t13)
t_t15.set_upstream(t_t12)
t_t15.set_upstream(t_t13)
t_t16.set_upstream(t_t12)
t_t16.set_upstream(t_t13)

t_end.set_upstream(t_t14)
t_end.set_upstream(t_t15)
t_end.set_upstream(t_t16)        

The functions method is verbose, but I prefer it.

One of the reasons I prefer the functions approach is that I was able to write an application to read a DAG and depict the relationships using a Sankey chart or using pyvis. The function approach provided clear markers for this activity.

I know you will say 'what a stupid idea'. Airflow renders the DAG on its canvas. Correct. Now imagine you have to share the output with someone. In case of Airflow, we have to keep grabbing images . . .

#airflow #sankeychart

要查看或添加评论,请登录

Bipin Patwardhan的更多文章

  • Writing code to generate code - Python + SQL version

    Writing code to generate code - Python + SQL version

    In my current project, we had to build multiple metric tables. The base table had 50 columns and we had to add around…

  • Change management is crucial (Databricks version)

    Change management is crucial (Databricks version)

    My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…

  • Friday fun - Impersonation (in a good way)

    Friday fun - Impersonation (in a good way)

    All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…

  • Any design is a trade-off

    Any design is a trade-off

    Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

    1 条评论
  • Quick Tip: The headache caused by import statements in Python

    Quick Tip: The headache caused by import statements in Python

    When developing applications, there has to be a method to the madness. Just because a programming environment allows…

  • Databricks: Enabling safety in utility jobs

    Databricks: Enabling safety in utility jobs

    I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…

  • A Simple Code Generator Using a Cool Python Feature

    A Simple Code Generator Using a Cool Python Feature

    For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…

  • Recap of my articles from 2024

    Recap of my articles from 2024

    As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…

  • Handling dates

    Handling dates

    Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.

  • pfff -- why are you spending time to save 16sec execution time

    pfff -- why are you spending time to save 16sec execution time

    In my current project, we are implementing a data processing and reporting application using Databricks. All the code…

    2 条评论

社区洞察

其他会员也浏览了