Apache Airflow for Data Engineering pipeline Part 1

Apache Airflow for Data Engineering pipeline Part 1

To build any big data product, you need data to work with and just like pipelines allow us to gather and transport oil or water from one place to another, data pipelines allow us to gather the data we need from a variety of different sources and then transform the data into whatever we need.

A pipeline is a representation of a data processing job and it comprises a set of operations that read input data, transform that data, and then write the output.

Data Pipeline

What is Airflow, and why do we need it?

Apache Airflow is an open-source platform written in Python that can be used to greatly improve the processing of these pipelines. This is a good tool for data engineering to author, task scheduler on steroids and monitoring workflows.

Apache airflow provides you with an amazing running complex data pipeline.

apache airflow

What is Data Pipeline?

At its base level, a data pipeline is an application that sits between raw data and a transformed data set, so between a data source and a data target.

apache airflow

But inside of the application, you can think of it as an assembly line, a series of step-by-step functions or tasks that are carried out on data to get a result. At a higher level, it's just data in and data out. When first thinking about what needs to go in your data pipeline, you need to know what your data sources are, quite often there are many and your data targets. What format does the resulting data need to be in and what will it be used for? Well, what will it be used for is what we need to ask ourselves before our data pipeline gets built?

What is Data Pipelines Used For?

So why do we build them? Well, quite often we're sharing processing logic, we have many different applications, marketing insights, APIs, machine learning models, etc., that need the data processed in some way. Instead of having each application and tool prepares its own data, we can integrate multiple tools without reinventing the data processing code for each one.

We can prepare data for visualization, for creating diagrams, graphs, charts, etc., for database migrations, for ingesting and integrating data into applications, for converting data formats from CSV or XML to JSON. Or putting it into a relational DB or a NoSQL store. And we could use them for real-time jobs, analytics and time-sensitive data. So, data pipelines come in different forms.

Common Types of Data Pipelines:

The type of data pipeline varies depending on the purpose it serves. It could be used for batch-driven processing, which is what we want if we're dealing with very large volumes of data. We don't need the results in real-time, but we need the computational process to happen at specific times, at periodic times. This often happens when we have multiple systems we need to query, so then we need to aggregate the data into a single warehouse using a batch job so that we can query this previously disconnected data.

We have real-time data pipelines for higher velocity data, so we need to know what's happening right now. We might be looking at server loads, request-response times, device telemetry, etc. And if we need to scale immediately and quickly, turning on a new data pipeline instance to handle bursts in traffic, for example, then we often turn to cloud providers who build these capabilities for us. When our requirements are more static or we can provide in advance for the required workloads we require, then we can turn to open source or self-hosted solutions. It all depends on our particular needs. What are some of the characteristics of data pipeline structure?

apache airflow

Pipeline Data Structure:

We want our pipeline structured in a way that our data can expand. New fields, new documents, new data sources without having to re-engineer a rigid schema each time we have a new requirement. So, there's no schema imposed. We could be using tabular data or hierarchical data. And we do get benefits from using this kind of flexible schema. So, we'll quite often take a modern NoSQL approach to our data pipeline structures. What are some of the key benefits of data pipelines? Well, we engineer our data requirements so?that the pipeline can embed data within apps.

Key Benefits of Data Pipelines:

We can process large volumes of data. We can capture metadata. We can leverage built-in components. And we can apply custom logic to our data.

apache airflow

Understanding the Airflow UI and Logs:

apache airflow

Airflow UI consists of some features as you can see in the above picture which makes it easy to monitor and troubleshoot your data pipelines, let’s talk about them in the following.

1-????DAG: In the DAG section of the Airflow UI, we have the main list of DAGs currently in Airflow. So, in this list of DAGS, we get various options for each DAG to edit it, to toggle it on or off, and we get the DAG itself, the DAG name which we can click on and?get more information on it, and drill down into the full details of the DAG.

?2-????Schedule: Then we get this scheduled definition in cron notation and we have things like at daily, or none if it doesn't run on a schedule and we have to kick it off manually at once. So, it will run once when it's kicked off and that's it, or at particular times or whatever the repeating schedule for the DAG is.

?3-????Owner: The owner of the DAG, now we can have user management in the admin section where we have different owners or we can just go with the single user Airflow depending on our needs.

4-????Recent Task: Then it has a listing of all the recent tasks, so if we have a success, it tells us and it gives us various information like if it's currently running or if it's failed. So, the status of some recent tasks. When it was the last run in ISO 8601 format in UTC, so the last one was 2018-05-09 at 00:56 in the morning.

?5-????DAG Run: the DAG run, so gives us information about successes, running, if it failed, and then some quick links to get into various aspects of the DAG. So, we can get into its tree view or graph view, or the task duration, task tries etc., or we have a little trigger DAG play button to trigger a DAG.

In the next section, we will look at the important parts of airflow and discuss how to do stuffs in airflow.






Mo Ahmadi??????

Web and Mobile app Designer ?? #Figma advocate

3 年

Thanks for posting

回复
Matin Beheshti

Development Manager | Digital Analyst | IT Project Manager | Scrum Alliance-CSM?

3 年

Thanks for sharing!

回复

要查看或添加评论,请登录

Ehsan Hemati的更多文章

  • Airflow Task - Part 3

    Airflow Task - Part 3

    In this article, I'll discuss passing arguments to DAGs and Python operator tasks. So first we'll look at some common…

    3 条评论
  • Apache Airflow for Data Engineering pipeline Part 2 - Airflow Workflow

    Apache Airflow for Data Engineering pipeline Part 2 - Airflow Workflow

    In the pre-article, a basic explanation of what airflow is and how it works. I also explained about the dashboard.

    2 条评论
  • Dimensions in Datawarehouse

    Dimensions in Datawarehouse

    How many types of dimensions have in DATA WAREHOUSE? At first, we need to talk about dimensions: What is dimension:…

    8 条评论

社区洞察

其他会员也浏览了