Non-Cloud (On-premises) data pipelines; Beginners guide

Non-Cloud (On-premises) data pipelines; Beginners guide

Cloud based data integrations tool gained significant popular these days. Developing data pipelines, deploying in production, creating alerts/monitoring became easy with simplified UI. But how to handle on-premises data as efficiently as cloud solutions. Are there any robust ecosystem tools for this?

The answer is "Yes". There are many tools, but we prefer to go to Apache Airflow and Kestra. So, in this article, we would like talk about both. Airflow was developed in Python, Kestra in JAVA. It is JAVA vs Python. We can guess the winner.

Selecting the right orchestration tool involves in-depth technical evaluation:

  • Cloud services are managed services whereas on-premises solution is not. So how long does it take to install and configure the tool? Both in development and production plays major role in selecting any tool.
  • Development time to write pipeline.
  • Deployment process and performance i.e.How long do task executions take? Does it perform well? Do performances scale?

Administration (Installation & Configuration)

Airflow

Standalone model:

  1. The installation starts with Python package installer (PIP).
  2. Edit configuration variables.
  3. Run airflow standalone that will start the instance with SQLite database and expose the user interface on port 8080.The standalone Airflow doesn't scale i.e. it can't run parallel jobs. Standalone mode comes with Sequential executor which is not idle for production. So, we need to switch to

Multicomponent model:

This is docker-compose installation that install Redis and Postgres service to manager workflows and comes with Celery as execution engine.

Kestra

Standalone model:

A simple docker-compose up -d allows the start of a Kestra instance alongside a Postgres container. Exposing the web interface on port 8080.

Multicomponent/High availability mode:

It leverages Kafka, a high-throughput messaging system, and Elasticsearch, a powerful search and analytics engine, for handling workflows and data.

Ease of code development/Code Syntax

Airflow is based on Python and the concept of Directed Acyclic Graph (DAG). Here is a basic hello-world DAG:

from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator

# Define the default arguments for the DAG
default_args = {
}

# Instantiate a DAG object
dag = DAG('hello_world',
    default_args=default_args,
    description='A simple DAG to say hello to the world',
    schedule_interval=None,
)

# Define the task
hello_task = BashOperator(
    task_id='hello_task',
    bash_command='echo "Hello, World!"',
    dag=dag,
)

# Define the task dependencies (if any)
hello_task        

Kestra syntax is based on YAML. Here is a basic hello-world flow:

id: hello_world
name: Hello World
description: A simple Kestra workflow to say hello to the world

tasks:
  - id: hello_task
    type: shell
    configuration:
      command: echo "Hello, World!"        

As Python code is less readable and less understandable than YAML. But it has the advantage of being highly customizable.

Performance/Benchmarks

Task Details:

Extracting data from SQL, CSV and Web and transforming data and finally push the combined data to Final SQL Table on same hardware.

The key point of using an orchestration engine is having concurrent execution.

Kestra, designed for high throughput, performs better than Airflow in micro-batch processing. Kestra’s backend in Java might also contribute to its better performance compared to Airflow’s Python foundation.

Conclusion:

When looking at the return on investment when choosing an orchestration tool, there are several points to consider:

  • Installing Kestra is easier than Airflow; it doesn’t require Python dependencies, and it comes with a ready-to-use docker-compose file using few services and without the need to understand what’s an executor to run task in parallel.
  • Creating pipelines with Kestra is simple, thanks to its syntax. You don’t need knowledge of a specific programming language because Kestra is designed to be agnostic. The declarative YAML design makes Kestra flows more readable compared to Airflow’s DAG equivalent, allowing developers to significantly reduce development time.
  • Kestra demonstrates better execution time than Airflow under any configuration setup.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了