登录查看更多内容

Non-Cloud (On-premises) data pipelines; Beginners guide

Suresh Bonam

Data management expert at Colgate-Palmolive

发布日期: 2024年3月16日

Cloud based data integrations tool gained significant popular these days. Developing data pipelines, deploying in production, creating alerts/monitoring became easy with simplified UI. But how to handle on-premises data as efficiently as cloud solutions. Are there any robust ecosystem tools for this?

The answer is "Yes". There are many tools, but we prefer to go to Apache Airflow and Kestra. So, in this article, we would like talk about both. Airflow was developed in Python, Kestra in JAVA. It is JAVA vs Python. We can guess the winner.

Selecting the right orchestration tool involves in-depth technical evaluation:

Cloud services are managed services whereas on-premises solution is not. So how long does it take to install and configure the tool? Both in development and production plays major role in selecting any tool.
Development time to write pipeline.
Deployment process and performance i.e.How long do task executions take? Does it perform well? Do performances scale?

Administration (Installation & Configuration)

Airflow

Standalone model:

The installation starts with Python package installer (PIP).
Edit configuration variables.
Run airflow standalone that will start the instance with SQLite database and expose the user interface on port 8080.The standalone Airflow doesn't scale i.e. it can't run parallel jobs. Standalone mode comes with Sequential executor which is not idle for production. So, we need to switch to

Multicomponent model:

This is docker-compose installation that install Redis and Postgres service to manager workflows and comes with Celery as execution engine.

Kestra

Standalone model:

A simple docker-compose up -d allows the start of a Kestra instance alongside a Postgres container. Exposing the web interface on port 8080.

Multicomponent/High availability mode:

It leverages Kafka, a high-throughput messaging system, and Elasticsearch, a powerful search and analytics engine, for handling workflows and data.

Arno Wakfer MCT 6 个月前

PySpark Introduction: Powering Big Data Processing…

Eduardo Miranda 3 个月前

Airflow Architecture

Mohan Nayak 6 个月前

Ease of code development/Code Syntax

Airflow is based on Python and the concept of Directed Acyclic Graph (DAG). Here is a basic hello-world DAG:

from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator

# Define the default arguments for the DAG
default_args = {
}

# Instantiate a DAG object
dag = DAG('hello_world',
    default_args=default_args,
    description='A simple DAG to say hello to the world',
    schedule_interval=None,
)

# Define the task
hello_task = BashOperator(
    task_id='hello_task',
    bash_command='echo "Hello, World!"',
    dag=dag,
)

# Define the task dependencies (if any)
hello_task

Kestra syntax is based on YAML. Here is a basic hello-world flow:

id: hello_world
name: Hello World
description: A simple Kestra workflow to say hello to the world

tasks:
  - id: hello_task
    type: shell
    configuration:
      command: echo "Hello, World!"

As Python code is less readable and less understandable than YAML. But it has the advantage of being highly customizable.

Performance/Benchmarks

Task Details:

Extracting data from SQL, CSV and Web and transforming data and finally push the combined data to Final SQL Table on same hardware.

The key point of using an orchestration engine is having concurrent execution.

Kestra, designed for high throughput, performs better than Airflow in micro-batch processing. Kestra’s backend in Java might also contribute to its better performance compared to Airflow’s Python foundation.

Conclusion:

When looking at the return on investment when choosing an orchestration tool, there are several points to consider:

Installing Kestra is easier than Airflow; it doesn’t require Python dependencies, and it comes with a ready-to-use docker-compose file using few services and without the need to understand what’s an executor to run task in parallel.
Creating pipelines with Kestra is simple, thanks to its syntax. You don’t need knowledge of a specific programming language because Kestra is designed to be agnostic. The declarative YAML design makes Kestra flows more readable compared to Airflow’s DAG equivalent, allowing developers to significantly reduce development time.
Kestra demonstrates better execution time than Airflow under any configuration setup.

Non-Cloud (On-premises) data pipelines; Beginners guide

Suresh Bonam

Data management expert at Colgate-Palmolive

Administration (Installation & Configuration)

Airflow

Standalone model:

Multicomponent model:

Kestra

Standalone model:

Multicomponent/High availability mode:

领英推荐

Ease of code development/Code Syntax

Performance/Benchmarks

Conclusion:

更多精彩文章

社区洞察

其他会员也浏览了

Airflow Architecture

Difference between SQL and PySpark

ETL with Python

Building an ETL App with Streamlit

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

BigData Analytics with PySpark

Harnessing the Power of Elasticsearch: boosting your search capabilities

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Administration (Installation & Configuration)

Airflow

Standalone model:

Multicomponent model:

Kestra

Standalone model:

Multicomponent/High availability mode:

领英推荐

Ease of code development/Code Syntax

Performance/Benchmarks

Conclusion:

Apache Spark VS DATABRICKS

2024年4月6日

Apache Spark Optimizations (Cont....)

2024年4月1日

Apache Spark Adaptive Query Execution & More Optimizations

2024年3月30日

Data Governance. How is it different from Data Management?

2024年3月26日

How companies implementing "DATA LAKE"

2024年3月22日

Function as a Service (Faas) -Small Code, Multiple Use cases

2024年3月19日

Battle of Cloud based Data Integration Tools: Azure ADF VS AWS Glue

2024年3月14日

ETL VS Data Orchestration: Modern data management solutions

2024年3月13日

社区洞察

其他会员也浏览了

Airflow Architecture

Difference between SQL and PySpark

ETL with Python

Building an ETL App with Streamlit

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

BigData Analytics with PySpark

Harnessing the Power of Elasticsearch: boosting your search capabilities

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data