A Beginner's Guide to Installing Apache Airflow
Apache Airflow is an open-source platform used for orchestrating complex workflows and data processing pipelines. Its flexibility, scalability, and ease of use have made it a popular choice among data engineers and developers. Installing Airflow for the first time might seem daunting, but with the right guidance, it can be a straightforward process. Here's a beginner's guide to installing Apache Airflow.
Pre-installation Checklist
Before you begin the installation process, there are a few prerequisites to consider:
1. Choose Your Installation Method
- Using pip: Suitable for getting started quickly or for development purposes.
- Docker: Provides an isolated environment and simplifies dependency management.
- Package Manager (e.g., apt, yum): Available for certain distributions but might not offer the latest version.
Here we will discuss on only pip.
2. Python Version
Airflow supports Python 3.6 or higher. Ensure that you have Python installed on your system.
3. Database
Choose a database backend for Airflow's metadata storage. Popular options include PostgreSQL, MySQL, and SQLite (best for testing or small-scale deployments).
4. Optional Dependencies
Some features in Airflow might require additional dependencies (e.g., Apache Hadoop, Microsoft Azure, Google Cloud, etc.). Install these based on your workflow needs.
Installation Steps
1. Install Airflow using pip
Create a Virtual Environment (Optional but Recommended)
Creating a virtual environment keeps your Python installation clean and helps manage dependencies.
Install Airflow
Use pip to install the latest stable version of Airflow:
$ export AIRFLOW_HOME="/workspaces/hands-on-introduction-data-engineering-4395021/airflow" && pip install "apache-airflow[celery]==2.7.2" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.2/constraints-3.8.txt"
2. Initialize Airflow Database
领英推荐
Initialize the Database
Airflow uses a database to store its metadata. Initialize the database using the following.
?$ airflow db init
$ airflow db check
[2023-12-08T21:54:33.209+0000] {db.py:1755} INFO - Connection successful.
?3. Start the Web Server and Scheduler
Start the Scheduler
The scheduler is responsible for triggering tasks according to their dependencies and schedules.
$ airflow scheduler
Start the Web Server
The web server provides a dashboard and interface to interact with Airflow.
$ airflow webserver --port 8080
Kill the Web Server & Scheduler Process
?$ cat $AIRFLOW_HOME/airflow-scheduler.pid | xargs kill
$ echo "" > $AIRFLOW_HOME/airflow-scheduler.pid
$ cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill
$ echo "" > $AIRFLOW_HOME/airflow-webserver.pid
?4. Access the Airflow UI
Open your web browser and visit https://localhost:8080 to access the Airflow web interface. Log in using the default credentials (username: airflow, password: airflow) and start creating your workflows.
Conclusion
Installing Apache Airflow is a crucial first step towards building and managing data pipelines efficiently. By following these steps, you can set up Airflow and start orchestrating your workflows effectively. Remember, while this pip-based installation method is straightforward, ensure your system meets the prerequisites, such as having Python 3.6 or higher installed, selecting an appropriate database backend, and considering additional dependencies based on your workflow requirements.