登录查看更多内容

Apache Airflow for Data Engineering pipeline Part 1

Ehsan Hemati

Data Engineer | Analytics Engineer | dbt | Python | Snowflake | AWS

发布日期: 2021年8月10日

To build any big data product, you need data to work with and just like pipelines allow us to gather and transport oil or water from one place to another, data pipelines allow us to gather the data we need from a variety of different sources and then transform the data into whatever we need.

A pipeline is a representation of a data processing job and it comprises a set of operations that read input data, transform that data, and then write the output.

What is Airflow, and why do we need it?

Apache Airflow is an open-source platform written in Python that can be used to greatly improve the processing of these pipelines. This is a good tool for data engineering to author, task scheduler on steroids and monitoring workflows.

Apache airflow provides you with an amazing running complex data pipeline.

What is Data Pipeline?

At its base level, a data pipeline is an application that sits between raw data and a transformed data set, so between a data source and a data target.

But inside of the application, you can think of it as an assembly line, a series of step-by-step functions or tasks that are carried out on data to get a result. At a higher level, it's just data in and data out. When first thinking about what needs to go in your data pipeline, you need to know what your data sources are, quite often there are many and your data targets. What format does the resulting data need to be in and what will it be used for? Well, what will it be used for is what we need to ask ourselves before our data pipeline gets built?

What is Data Pipelines Used For?

So why do we build them? Well, quite often we're sharing processing logic, we have many different applications, marketing insights, APIs, machine learning models, etc., that need the data processed in some way. Instead of having each application and tool prepares its own data, we can integrate multiple tools without reinventing the data processing code for each one.

We can prepare data for visualization, for creating diagrams, graphs, charts, etc., for database migrations, for ingesting and integrating data into applications, for converting data formats from CSV or XML to JSON. Or putting it into a relational DB or a NoSQL store. And we could use them for real-time jobs, analytics and time-sensitive data. So, data pipelines come in different forms.

Common Types of Data Pipelines:

The type of data pipeline varies depending on the purpose it serves. It could be used for batch-driven processing, which is what we want if we're dealing with very large volumes of data. We don't need the results in real-time, but we need the computational process to happen at specific times, at periodic times. This often happens when we have multiple systems we need to query, so then we need to aggregate the data into a single warehouse using a batch job so that we can query this previously disconnected data.

We have real-time data pipelines for higher velocity data, so we need to know what's happening right now. We might be looking at server loads, request-response times, device telemetry, etc. And if we need to scale immediately and quickly, turning on a new data pipeline instance to handle bursts in traffic, for example, then we often turn to cloud providers who build these capabilities for us. When our requirements are more static or we can provide in advance for the required workloads we require, then we can turn to open source or self-hosted solutions. It all depends on our particular needs. What are some of the characteristics of data pipeline structure?

领英推荐

Data Engineering: From Zero ETL in the Past to LLM as…

Dr. RVS Praveen Ph.D 1 年前

Databricks SQL Series — Part 5 — Managing and Securing…

Krishna Yogi Kolluru 7 个月前

Delta Live Tables in Databricks Series —Part 2 — The…

Krishna Yogi Kolluru 8 个月前

Pipeline Data Structure:

We want our pipeline structured in a way that our data can expand. New fields, new documents, new data sources without having to re-engineer a rigid schema each time we have a new requirement. So, there's no schema imposed. We could be using tabular data or hierarchical data. And we do get benefits from using this kind of flexible schema. So, we'll quite often take a modern NoSQL approach to our data pipeline structures. What are some of the key benefits of data pipelines? Well, we engineer our data requirements so?that the pipeline can embed data within apps.

Key Benefits of Data Pipelines:

We can process large volumes of data. We can capture metadata. We can leverage built-in components. And we can apply custom logic to our data.

Understanding the Airflow UI and Logs:

Airflow UI consists of some features as you can see in the above picture which makes it easy to monitor and troubleshoot your data pipelines, let’s talk about them in the following.

1-????DAG: In the DAG section of the Airflow UI, we have the main list of DAGs currently in Airflow. So, in this list of DAGS, we get various options for each DAG to edit it, to toggle it on or off, and we get the DAG itself, the DAG name which we can click on and?get more information on it, and drill down into the full details of the DAG.

?2-????Schedule: Then we get this scheduled definition in cron notation and we have things like at daily, or none if it doesn't run on a schedule and we have to kick it off manually at once. So, it will run once when it's kicked off and that's it, or at particular times or whatever the repeating schedule for the DAG is.

?3-????Owner: The owner of the DAG, now we can have user management in the admin section where we have different owners or we can just go with the single user Airflow depending on our needs.

4-????Recent Task: Then it has a listing of all the recent tasks, so if we have a success, it tells us and it gives us various information like if it's currently running or if it's failed. So, the status of some recent tasks. When it was the last run in ISO 8601 format in UTC, so the last one was 2018-05-09 at 00:56 in the morning.

?5-????DAG Run: the DAG run, so gives us information about successes, running, if it failed, and then some quick links to get into various aspects of the DAG. So, we can get into its tree view or graph view, or the task duration, task tries etc., or we have a little trigger DAG play button to trigger a DAG.

In the next section, we will look at the important parts of airflow and discuss how to do stuffs in airflow.

Mo Ahmadi??????

Web and Mobile app Designer ?? #Figma advocate

3 年

Thanks for posting

Matin Beheshti

Development Manager | Digital Analyst | IT Project Manager | Scrum Alliance-CSM?

3 年

Thanks for sharing!

查看更多评论

要查看或添加评论，请登录

Ehsan Hemati的更多文章

Airflow Task - Part 3

2021年8月30日

Airflow Task - Part 3

In this article, I'll discuss passing arguments to DAGs and Python operator tasks. So first we'll look at some common…

3 条评论
Apache Airflow for Data Engineering pipeline Part 2 - Airflow Workflow

2021年8月20日

Apache Airflow for Data Engineering pipeline Part 2 - Airflow Workflow

In the pre-article, a basic explanation of what airflow is and how it works. I also explained about the dashboard.

2 条评论
Dimensions in Datawarehouse

2021年7月28日

Dimensions in Datawarehouse

How many types of dimensions have in DATA WAREHOUSE? At first, we need to talk about dimensions: What is dimension:…

8 条评论

Apache Airflow for Data Engineering pipeline Part 1

Ehsan Hemati

Data Engineer | Analytics Engineer | dbt | Python | Snowflake | AWS

领英推荐

Ehsan Hemati的更多文章

社区洞察

其他会员也浏览了

Building a Simple Data Pipeline with Mage: A Beginner's Guide

All About dbt (Data Build Tool) with BigQuery ??

Working with Semi-Structured JSON Data in Databricks

A Guide to Azure Data Engineering Services & Its Benefits

Architecting Data Pipelines

SQLMesh: The future of DataOps

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Data Engineering with Apache Airflow, Snowflake, Snowpark, dbt & Cosmos, Astronomer

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

Master Apache Hudi Streamer: 15+ Hands-On Labs, Exercise Materials, and Videos - The Go-To Guide for Companies, Data Leaders, Engineers, and Developer

领英推荐

Ehsan Hemati的更多文章

Airflow Task - Part 3

Apache Airflow for Data Engineering pipeline Part 2 - Airflow Workflow

Dimensions in Datawarehouse

社区洞察

其他会员也浏览了

Building a Simple Data Pipeline with Mage: A Beginner's Guide

All About dbt (Data Build Tool) with BigQuery ??

Working with Semi-Structured JSON Data in Databricks

A Guide to Azure Data Engineering Services & Its Benefits

Architecting Data Pipelines

SQLMesh: The future of DataOps

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Data Engineering with Apache Airflow, Snowflake, Snowpark, dbt & Cosmos, Astronomer

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

Master Apache Hudi Streamer: 15+ Hands-On Labs, Exercise Materials, and Videos - The Go-To Guide for Companies, Data Leaders, Engineers, and Developer