Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)
Satish Chandra Gupta
Data/ML Practitioner ? Advisor/Consultant for data & AI strategy, data infra, machine learning, and LLMs/GPT ? Ex- Amazon, Microsoft Research
You care about the data. Actually, you really care about the insights from that data or ML models you train with that data. And that forces you to care about the data.
You may also care about how you clean, curate, and transform the raw data into usable data. Therefore, the data pipeline at most has a nuisance value for you, an unavoidable chore.
Considering this pyramid of what you really care about and what you are forced to care about, where do you place data pipeline orchestration tools? How much of your mind share is occupied by that choice? My guess is that it is very little, almost nothing, or maybe even total indifference.
Yet, it is one of the most critical decisions in building your data and ML infra. Let me lay out the choices so you can make informed decisions.
But first, let’s quickly recap?data, ML, and MLOps pipelines:
The boundaries between these three are fluid and overlapping. In the future, these may converge and become one.
Let’s examine the choices open source and various cloud ecosystems offer.
Open Source Data Pipeline Orchestration Tools
Apache ecosystem was and continues to be an important part of the data stack. No wonder that 2 out of the 3 tools listed here are apache projects.
Apache Oozie
Apache Oozie?has been around for quite a while for executing workflow DAG. It integrates nicely with the Hadoop ecosystem. If your organization is using it, it will a big endeavor to move away from it. It still got life and can carry you to some distance.
But do not set up new projects or new infra with it. For that, look at the next tool below.
Apache Airflow
At the moment, nobody gets fired for choosing?Apache Airflow. It is the default choice and a pretty good one too. No wonder both AWS and Google Cloud offer a managed Airflow.
You can’t go wrong by choosing Apache Airflow, but take a look and keep an eye on the next tool in the list.
Flyte
As you expand the pipeline orchestration from ETL to machine learning tasks, you may want to check out if?Flyte?suits your case better. It fills some of the?gaps in Airflow?w.r.t. ML jobs.
Data Pipeline Orchestration Tools on AWS
Amazon is a?customer-first?company, and its offerings reflect that. AWS services are easy to start with, but the one you pick first may not remain suitable as your use case expands.
I suggest not blindly picking the easiest AWS tool for you, but pausing and thinking a little bit about your future needs. Data pipelines have an uncanny ability to quickly grow and become very complex. Then the unavoidable migration will carry a good amount of outage risk.
AWS Step Function
AWS Step Function?is an apt tool for automating business process workflows. But it can be used for?building data pipelines?too. If your data pipeline is simple and consists of a few steps, this is probably the easiest to start with. But I am very reluctant to advise you to do so.
AWS Data Pipeline
AWS Data Pipeline?is a service to move data from AWS and on-premises data sources to AWS compute services, run transformations, and store it in a data warehouse or a data lake. You can knit together an AWS Data Pipeline with S3, RDS, DynamoDB, and Redshift as data storage and EC2 and EMR as compute services. It is easy to use, and yet very powerful and versatile. But read the next two options.
AWS Glue Workflow
AWS Glue Workflow?is another tool on AWS to perform ETL workflows. It is unclear to me why Amazon has two tools with largely overlapping use cases. I am biased to use AWS Data Pipeline by default, and use Glue Workflow only if the whole data infra is built of AWS Glue and Amazon Athena.
领英推荐
Amazon Managed Workflow for Apache Airflow (MWAA)
If you want to stick to Apache Airflow,?MWAA?may suit you the most. It is a secure and highly available managed workflow orchestration for Apache Airflow. It is a vendor-independent option, and you don’t need to master a new tool.
Data Pipeline Orchestration Tools on Google Cloud
Google is a?technology-first?company, and it offers only two clearly differentiable choices. Both are built on open-source technology.
Cloud Data Fusion
Cloud Data Fusion?is a fully managed GUI tool to define ETL/ELT data pipelines. It is based on?CDAP, an open-source framework for building data analytic applications. If your team is not tech-heavy and building mainly analytics applications, then Cloud Data Fusion will suffice.
Cloud Composer
Cloud Composer?is a fully managed workflow orchestration service built on Apache Airflow. If your workflows are inching towards data science and spanning across hybrid and multi-cloud environments, then Cloud Composer (which is Airflow under the hood) is a better choice.
Data Pipeline Orchestration Tools on Azure
Microsoft is a?sales-first?company. It is just so good at selling that it has beaten arguably a better cloud stack of Google, and given Amazon’s customer obsession and first-mover advantage a run for its money, despite being a late entrant in cloud services.
Azure Data Factory
Azure Data Factory?is a data integration and transformation service to construct code-free ETL and ELT pipelines. It is often used to process data from diverse sources and deliver integrated data to Azure Synapse Analytics. If you are on Azure and doing mainly data analytics, then using it is a no-brainer.
Oozie on HDInsight
Oozie on HDInsight?is the?classical Hadoop stack on Azure?running Oozie, Spark, HBase, Storm, etc. You can lift-and-shift your Hadoop workloads and their orchestration to Azure by using this option.
DIY Apache Airflow
You can?deploy Apache Airflow on Azure?too, but there is no managed service.
Other Choices
There are some interesting combinations that simplify the data pipelines for common use cases. These alternatives may not apply to complex and most general data pipelines, but nonetheless might be best for your use case.
It consists of two parts:
Summary
Seeing this myriad of choices can be confusing and distressing. Here are some defaults that may help you choose a fairly safe option:
ML4Devs is a biweekly newsletter for software developers. The aim is to curate resources for practitioners to design, develop, deploy, and maintain ML applications at scale to drive measurable positive business impact. Each issue discusses a topic from a developer’s viewpoint.
Enjoyed this? Originally published in?ML4Devs.com. Don't miss the next issue,?get it in your email:
Platform Engineering for Data with NeuronSphere.io
1 年I’d add NeuronSphere under “other choices” :-)
CEO @ Mage ??♀? ??
1 年Nice post! Very comprehensive. I’d add Mage on the list under “Other choices: transformation while ingesting data into a data warehouse or lakehouse using a tool”