Data Pipelines are the arteries that bring fresh and cleansed data to your AI/Machine Learning engine's heart. If you are a Data-driven AI/Machine Learning Practitioner you are already familiar with one or more of the following open sourced frameworks that help with Data Pipelines: Linkedin Azkaban, Spotify Luigi, Pinterest Pinball, or Airbnb Airflow.
If you are beginning this journey you should take a look at this excellent article by Robert Chang AirBnB. Also, check out this talk by Maxime Beauchemin where he discusses how to use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks.
So what is a Data Pipeline DAG? Visually, a node in a graph represents a Pipeline task, and an arrow represents the dependency of one Pipeline task on another. Given that data only needs to be computed once on a given task and the computation then carries forward, the graph is directed and acyclic. This is why Airflow jobs are commonly referred to as “DAGs†-Directed Acyclic Graphs
One of the cool things about Airbnb’s open-sourced tool Airflow is its UI. It helps visualize and enable management of complex Data Pipelines. It allows any users to use (Python) code as configuration to visualize a Pipeline's DAG . The author of a Data Pipeline must define the structure of dependencies among tasks in order to visualize them.
As noted in this thoughtful article:
"Code as a workflow also allows you to reuse parts of DAG’s if you need to, reducing code duplication and making things simpler in the long run. This reduces the complexity of the overall system and frees up developer time to work on more important and impactful tasks"
Making Data Pipelines simpler is a key focus of the AWS managed service AWS Glue. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog.
Once cataloged, your data is immediately searchable, queryable, and available for further wrangling activity. AWS Glue generates the code to execute your data transformations and data loading processes.
How do you deal with your Data Pipelines today? Do share your thoughts on how you see this evolving - drop me a note privately or via the comment section below.
About the Author:
Madhu cherishes the opportunity to learn and collaborate; he has three decades of experience on how to nurture the emergence of beachhead market ideations worldwide. Note that what is expressed by Madhu here is of his own interest and is in no way reflective of his employer.