Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

Satish Chandra Gupta

Data/ML Practitioner ? Advisor/Consultant for data & AI strategy, data infra, machine learning, and LLMs/GPT ? Ex- Amazon, Microsoft Research

发布日期: 2022年11月11日

You care about the data. Actually, you really care about the insights from that data or ML models you train with that data. And that forces you to care about the data.

You may also care about how you clean, curate, and transform the raw data into usable data. Therefore, the data pipeline at most has a nuisance value for you, an unavoidable chore.

Considering this pyramid of what you really care about and what you are forced to care about, where do you place data pipeline orchestration tools? How much of your mind share is occupied by that choice? My guess is that it is very little, almost nothing, or maybe even total indifference.

Yet, it is one of the most critical decisions in building your data and ML infra. Let me lay out the choices so you can make informed decisions.

But first, let’s quickly recap?data, ML, and MLOps pipelines:

Data pipelines?get data to the warehouse or lake.
ML pipelines?transform data before training or inference.
MLOps pipelines?automate ML workflows.

The boundaries between these three are fluid and overlapping. In the future, these may converge and become one.

Let’s examine the choices open source and various cloud ecosystems offer.

Open Source Data Pipeline Orchestration Tools

Apache ecosystem was and continues to be an important part of the data stack. No wonder that 2 out of the 3 tools listed here are apache projects.

Apache Oozie

Apache Oozie?has been around for quite a while for executing workflow DAG. It integrates nicely with the Hadoop ecosystem. If your organization is using it, it will a big endeavor to move away from it. It still got life and can carry you to some distance.

But do not set up new projects or new infra with it. For that, look at the next tool below.

Apache Airflow

At the moment, nobody gets fired for choosing?Apache Airflow. It is the default choice and a pretty good one too. No wonder both AWS and Google Cloud offer a managed Airflow.

You can’t go wrong by choosing Apache Airflow, but take a look and keep an eye on the next tool in the list.

Flyte

As you expand the pipeline orchestration from ETL to machine learning tasks, you may want to check out if?Flyte?suits your case better. It fills some of the?gaps in Airflow?w.r.t. ML jobs.

Data Pipeline Orchestration Tools on AWS

Amazon is a?customer-first?company, and its offerings reflect that. AWS services are easy to start with, but the one you pick first may not remain suitable as your use case expands.

I suggest not blindly picking the easiest AWS tool for you, but pausing and thinking a little bit about your future needs. Data pipelines have an uncanny ability to quickly grow and become very complex. Then the unavoidable migration will carry a good amount of outage risk.

AWS Step Function

AWS Step Function?is an apt tool for automating business process workflows. But it can be used for?building data pipelines?too. If your data pipeline is simple and consists of a few steps, this is probably the easiest to start with. But I am very reluctant to advise you to do so.

AWS Data Pipeline

AWS Data Pipeline?is a service to move data from AWS and on-premises data sources to AWS compute services, run transformations, and store it in a data warehouse or a data lake. You can knit together an AWS Data Pipeline with S3, RDS, DynamoDB, and Redshift as data storage and EC2 and EMR as compute services. It is easy to use, and yet very powerful and versatile. But read the next two options.

AWS Glue Workflow

AWS Glue Workflow?is another tool on AWS to perform ETL workflows. It is unclear to me why Amazon has two tools with largely overlapping use cases. I am biased to use AWS Data Pipeline by default, and use Glue Workflow only if the whole data infra is built of AWS Glue and Amazon Athena.

Swami Sivasubramanian 1 年前

How to Choose the Right Data Ingestion Service: AWS…

Dr Rabi Prasad Padhy 8 个月前

How to Simple Scale ETL with Azure Data Factory and…

VaporVM 2 年前

Amazon Managed Workflow for Apache Airflow (MWAA)

If you want to stick to Apache Airflow,?MWAA?may suit you the most. It is a secure and highly available managed workflow orchestration for Apache Airflow. It is a vendor-independent option, and you don’t need to master a new tool.

Data Pipeline Orchestration Tools on Google Cloud

Google is a?technology-first?company, and it offers only two clearly differentiable choices. Both are built on open-source technology.

Cloud Data Fusion

Cloud Data Fusion?is a fully managed GUI tool to define ETL/ELT data pipelines. It is based on?CDAP, an open-source framework for building data analytic applications. If your team is not tech-heavy and building mainly analytics applications, then Cloud Data Fusion will suffice.

Cloud Composer

Cloud Composer?is a fully managed workflow orchestration service built on Apache Airflow. If your workflows are inching towards data science and spanning across hybrid and multi-cloud environments, then Cloud Composer (which is Airflow under the hood) is a better choice.

Data Pipeline Orchestration Tools on Azure

Microsoft is a?sales-first?company. It is just so good at selling that it has beaten arguably a better cloud stack of Google, and given Amazon’s customer obsession and first-mover advantage a run for its money, despite being a late entrant in cloud services.

Azure Data Factory

Azure Data Factory?is a data integration and transformation service to construct code-free ETL and ELT pipelines. It is often used to process data from diverse sources and deliver integrated data to Azure Synapse Analytics. If you are on Azure and doing mainly data analytics, then using it is a no-brainer.

Oozie on HDInsight

Oozie on HDInsight?is the?classical Hadoop stack on Azure?running Oozie, Spark, HBase, Storm, etc. You can lift-and-shift your Hadoop workloads and their orchestration to Azure by using this option.

DIY Apache Airflow

You can?deploy Apache Airflow on Azure?too, but there is no managed service.

Other Choices

There are some interesting combinations that simplify the data pipelines for common use cases. These alternatives may not apply to complex and most general data pipelines, but nonetheless might be best for your use case.

It consists of two parts:

Light transformation while ingesting data into a data warehouse or lakehouse using a tool like?airbyte,?fivetran, Apache?NiFi,?meltano, or?stitch.
SQL pipeline using?dbt?to transform data in the data warehouse or lakehouse.

Summary

Seeing this myriad of choices can be confusing and distressing. Here are some defaults that may help you choose a fairly safe option:

If vendor-locking and future-proofing is your key concern, then go for Airflow (or its managed version on your cloud vendor).
If analytics is your main application, then you can pick a simpler option on your cloud vendor (AWS Data Pipeline, Google Cloud Fusion, Azure Data Factory).
If you consume data from diverse sources and have in-house SQL expertise, you can evaluate airbyte/fivetran + dbt.
If you are mainly an ML shop, then may check Flyte (and not just Airflow).

ML4Devs is a biweekly newsletter for software developers. The aim is to curate resources for practitioners to design, develop, deploy, and maintain ML applications at scale to drive measurable positive business impact. Each issue discusses a topic from a developer’s viewpoint.

Enjoyed this? Originally published in?ML4Devs.com. Don't miss the next issue,?get it in your email:

ML4Devs

8,870 位关注者

Brian Greene

Platform Engineering for Data with NeuronSphere.io

1 年

I’d add NeuronSphere under “other choices” :-)

Tommy Dang

CEO @ Mage ??♀? ??

1 年

Nice post! Very comprehensive. I’d add Mage on the list under “Other choices: transformation while ingesting data into a data warehouse or lakehouse using a tool”

3 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

Satish Chandra Gupta

Data/ML Practitioner ? Advisor/Consultant for data & AI strategy, data infra, machine learning, and LLMs/GPT ? Ex- Amazon, Microsoft Research

Open Source Data Pipeline Orchestration Tools

Apache Oozie

Apache Airflow

Flyte

Data Pipeline Orchestration Tools on AWS

AWS Step Function

AWS Data Pipeline

AWS Glue Workflow

领英推荐

Amazon Managed Workflow for Apache Airflow (MWAA)

Data Pipeline Orchestration Tools on Google Cloud

Cloud Data Fusion

Cloud Composer

Data Pipeline Orchestration Tools on Azure

Azure Data Factory

Oozie on HDInsight

DIY Apache Airflow

Other Choices

Summary

ML4Devs

8,870 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

How to Simple Scale ETL with Azure Data Factory and Azure Data Bricks

Spark Your Data Journey: A Glue-tastic Guide to Big Data Brilliance

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Sneak Peek into Trino with Azure HDInsight on AKS

Exploring Azure Synapse Analytics: Dedicated Pools vs. Serverless Pools

A Guide to Use Databricks for Data Science Enthusiasts

New Data Platforms: The Announced End of ETLs?

Transforming Big Data into Insights with AWS CDK / AWS Step Functions and more

AWS data integration - AWS re:Invent 2023 recap

AWS data processing services and planning

Open Source Data Pipeline Orchestration Tools

Apache Oozie

Apache Airflow

Flyte

Data Pipeline Orchestration Tools on AWS

AWS Step Function

AWS Data Pipeline

AWS Glue Workflow

领英推荐

Amazon Managed Workflow for Apache Airflow (MWAA)

Data Pipeline Orchestration Tools on Google Cloud

Cloud Data Fusion

Cloud Composer

Data Pipeline Orchestration Tools on Azure

Azure Data Factory

Oozie on HDInsight

DIY Apache Airflow

Other Choices

Summary

ML4Devs

8,870 位关注者

MLOps: All-in-One Platform vs Piecemeal Tools (ML4Devs Newsletter, Issue 18)

2022年12月21日

SQL Renaissance (ML4Devs Newsletter, Issue 17)

2022年11月26日

Chasm of AI Security Between Research and Products (ML4Devs Newsletter, Issue 15)

2022年10月28日

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

2022年10月8日

AI is Like Teenage?Sex… (ML4Devs Newsletter, Issue 13)

2022年9月23日

Should You Care About MLOps? Why and How Much? (ML4Devs Newsletter, Issue 12)

2022年9月9日

Machine Learning vs. Traditional Software Development (ML4Devs Newsletter, Issue 11)

2022年8月18日

MLOps for Continuous Integration, Delivery, and Training (ML4Devs Newsletter, Issue 10)

2022年8月5日

When to (Not) Use Machine Learning (ML4Devs Newsletter, Issue 9)

2022年7月22日

Why Machine Learning Projects Fail (ML4Devs Newsletter, Issue 8)

2022年7月8日

社区洞察

其他会员也浏览了

How to Simple Scale ETL with Azure Data Factory and Azure Data Bricks

Spark Your Data Journey: A Glue-tastic Guide to Big Data Brilliance

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Sneak Peek into Trino with Azure HDInsight on AKS

Exploring Azure Synapse Analytics: Dedicated Pools vs. Serverless Pools

A Guide to Use Databricks for Data Science Enthusiasts

New Data Platforms: The Announced End of ETLs?

Transforming Big Data into Insights with AWS CDK / AWS Step Functions and more

AWS data integration - AWS re:Invent 2023 recap

AWS data processing services and planning