Non-Cloud (On-premises) data pipelines; Beginners guide
Cloud based data integrations tool gained significant popular these days. Developing data pipelines, deploying in production, creating alerts/monitoring became easy with simplified UI. But how to handle on-premises data as efficiently as cloud solutions. Are there any robust ecosystem tools for this?
The answer is "Yes". There are many tools, but we prefer to go to Apache Airflow and Kestra. So, in this article, we would like talk about both. Airflow was developed in Python, Kestra in JAVA. It is JAVA vs Python. We can guess the winner.
Selecting the right orchestration tool involves in-depth technical evaluation:
Administration (Installation & Configuration)
Airflow
Standalone model:
Multicomponent model:
This is docker-compose installation that install Redis and Postgres service to manager workflows and comes with Celery as execution engine.
Kestra
Standalone model:
A simple docker-compose up -d allows the start of a Kestra instance alongside a Postgres container. Exposing the web interface on port 8080.
Multicomponent/High availability mode:
It leverages Kafka, a high-throughput messaging system, and Elasticsearch, a powerful search and analytics engine, for handling workflows and data.
领英推荐
Ease of code development/Code Syntax
Airflow is based on Python and the concept of Directed Acyclic Graph (DAG). Here is a basic hello-world DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
# Define the default arguments for the DAG
default_args = {
}
# Instantiate a DAG object
dag = DAG('hello_world',
default_args=default_args,
description='A simple DAG to say hello to the world',
schedule_interval=None,
)
# Define the task
hello_task = BashOperator(
task_id='hello_task',
bash_command='echo "Hello, World!"',
dag=dag,
)
# Define the task dependencies (if any)
hello_task
Kestra syntax is based on YAML. Here is a basic hello-world flow:
id: hello_world
name: Hello World
description: A simple Kestra workflow to say hello to the world
tasks:
- id: hello_task
type: shell
configuration:
command: echo "Hello, World!"
As Python code is less readable and less understandable than YAML. But it has the advantage of being highly customizable.
Performance/Benchmarks
Task Details:
Extracting data from SQL, CSV and Web and transforming data and finally push the combined data to Final SQL Table on same hardware.
The key point of using an orchestration engine is having concurrent execution.
Kestra, designed for high throughput, performs better than Airflow in micro-batch processing. Kestra’s backend in Java might also contribute to its better performance compared to Airflow’s Python foundation.
Conclusion:
When looking at the return on investment when choosing an orchestration tool, there are several points to consider: