Mastering Apache Airflow with Docker Compose: A Step-by-Step Guide
Adarsh Singh Dikhit
AI/ML Module Lead at ExamRoom.AI? || Tableau || ThoughtSpot || LLM || CV || NLP || Truefoundry
In today's fast-paced world of data engineering, automating, scheduling, and monitoring workflows have become essential to managing complex data pipelines. One of the most powerful tools in this space is Apache Airflow, which enables users to programmatically author, schedule, and monitor workflows.
In this article, I’ll introduce Apache Airflow, explain its architecture, and guide you through setting it up with Docker Compose for streamlined deployment. I’ll also cover best practices to get the most out of Airflow in a containerized environment.
What is Apache Airflow?
Apache Airflow is an open-source platform designed to automate workflows by allowing users to author, schedule, and monitor workflows as Directed Acyclic Graphs (DAGs). Workflows are defined using Python code, which makes Airflow highly flexible and powerful for managing ETL pipelines, data processing tasks, and machine learning workflows.
Key Features of Apache Airflow:
Airflow Architecture
The architecture of Airflow is composed of the following main components:
Setting Up Airflow with Docker Compose
Why Docker Compose?
Docker Compose simplifies setting up an environment for running Airflow by using containers for each component (scheduler, web server, workers, and metadata database). With Docker Compose, you can deploy the entire Airflow stack with a single command, making it easy to manage, scale, and customize.
By using Docker Compose, you also ensure that the environment is consistent across different systems. This setup is ideal for both local development and production environments.
Step-by-Step Guide to Running Airflow with Docker Compose
I’ve created a GitHub repository that provides the complete Docker Compose setup for running Apache Airflow. You can find the repository here: Apache Airflow with Docker Compose.
1. Clone the Repository
git clone https://github.com/adarsh-dikhit/apache_airflow_with_docker_compose.git
cd apache_airflow_with_docker_compose
2. Set Up Environment Variables
You need to create an .env file to set up your environment variables like Airflow UID (User ID). A sample .env file is included in the repository.
3. Initialize Airflow
Before starting Airflow, you need to initialize the Airflow database.
docker-compose up airflow-init
This command will initialize the metadata database, which Airflow uses to store DAGs, task states, and more.
4. Start the Airflow Services
Now that the database is set up, you can start all the Airflow services:
docker-compose up
This command will start the following services:
5. Access the Airflow UI
Once the services are up, you can access the Airflow web UI by navigating to https://localhost:8080 in your browser.
Use the default credentials provided in the docker-compose.yml file to log in.
6. Working with DAGs
To add your custom DAGs, place your Python files defining the DAGs inside the dags/ directory in the repository. The DAGs will be automatically detected by Airflow and will appear in the UI.
You can also monitor task statuses, logs, and the overall health of your pipelines directly from the UI.
Example DAG: Scheduling a Simple Task
Here’s a simple example of how a basic DAG can be written:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime(2024, 10, 1),
'retries': 1
}
dag = DAG('example_dag', default_args=default_args, schedule_interval='@daily')
start_task = DummyOperator(task_id='start_task', dag=dag)
end_task = DummyOperator(task_id='end_task', dag=dag)
start_task >> end_task
This DAG defines a basic workflow with two tasks: start_task and end_task, and schedules the DAG to run daily. You can drop this Python file in the dags/ folder, and it will show up in the Airflow UI.
Scaling Airflow with Docker Compose
As your workflows grow, you may need to scale Airflow by adding more workers. This can be easily done with Docker Compose by scaling the worker service:
docker-compose up --scale worker=3
This command will start three worker containers, allowing for parallel execution of tasks, which improves performance for large DAGs.
Best Practices for Running Airflow with Docker Compose
Conclusion
With Docker Compose, you can easily deploy Apache Airflow, taking advantage of its powerful scheduling and orchestration capabilities. This setup ensures consistency, ease of deployment, and scalability, making it ideal for both local development and production environments.
I hope this guide helps you in mastering Airflow through Docker Compose. Check out my GitHub repository for the full setup and feel free to contribute or leave feedback.
GitHub Repository: Apache Airflow with Docker Compose
Let me know in the comments if you have any questions or suggestions. Happy automating!