登录查看更多内容

End-to-End Data Engineering: OpenAQ API to Real-Time Dashboards Using Spark and Airflow

Ali MARZOUK

Software engineer | Interested in data engineering

发布日期: 2024年5月20日

Objective

The main goal behind this project is to gain hands-on knowledge of various tools used in the data engineering field. To achieve this goal we designed and implemented an ELT (or ELTL - Extract Load Transform Load) pipeline. And to get a further understanding of the exploited data, a minimal dashboard is available.

Data source choice

As any software engineer I have too many incomplete side projects. And as it turned out, the main reason for this lack of motivation is that the data I used to choose does not interest me. So for this project, I took the time to choose a consistent data source for which I have genuine interest. And thanks to joseph wibowo's advice I explored the public apis repo to find the OpenAQ API.

Using the OpenAQ API, I chose to conduct an exploration of the air quality in the area where I live which is the Paris (Ile de France) area.

System design

The system is first composed of a ELTL data pipeline that extracts the data from the OpenAQ API and stores it in a data warehouse daily.?

The second part of the system is the data visualization App. It is composed of a web dashboard requesting data from a Rest API.

This next graph summarizes the technologies choices for the system.

For the cloud provider, the easiest choice was the Google Cloud Platform (GCP) given that it offers a 300$ trial for 90 days. By the way Data Talks Zoomcamp have excellent material for environments and softwares setups.

Data Lake: Google Cloud Storage (GCS), to store raw unstructured data extractions.
Data Warehouse: Big Query (BQ), to store transformed and structred data.
Data processing middleware: PySpark, optimized for processing high data volume. Setup steps can be found here.
Pipeline orchestration: Airflow, workflow management platform developed by Airbnb and maintained by the Apache Foundation.
Rest API : Flask, lightweight python framework.
Dashboard: dc.js library to render charts and manage crossfiltering.

Extract - Load

OpenAQ exposes a standard Rest API accessible through a free API Key. For this step we used Spark to easily manage important volume of data.?

The extracted data is being saved to a GCS bucket under a directory with the extraction date as a name.

You can find the full code for the spark_api_to_gcs.py under the “spark” directory.

领英推荐

Best books to learn Data Engineering

GUVI Geek Networks, IITM Research Park 1 年前

Data Engineering & Ice Cream, Together At Last

Cleartelligence 1 年前

Mastering Pandas for Data Engineers: A 60-Day Data…

ITVersity, Inc. 2 个月前

Transform - Load

With the data stored in the GCS bucket (Data Lake), we can now transform and structure it into a star schema. This transformation is carried out using a PySpark job that processes the raw data and organizes it according to the following ERD.

The data is then stored in a BigQuery dataset. All of this is done in the spark_gcs_to_bq.py job.

Pipeline orchestration

After implementing the Extract Load and Transform jobs, now we need an orchestration middleware to execute our pipeline daily.

Airflow is a great choice, considering its robust architecture and that it is well maintained by the apache foundation.

Our Airflow pipeline is very simple and minimal. It is composed of two jobs:

api_to_gcs: query the OpenAQ API and save it to GCS.
gcs_to_bq: transform data in GCS, structure it and save it into BigQuery dataset.

Visualization App

Our final app is composed of a Rest API and a dashboard frontend client.

For the backend Rest API we use Flask. It is the best choice for our use case, being a lightweight framework with a gradual learning curve.

To avoid overloading the network with large amounts of data, we divide data in several chunks and send it using Server Sent Event.

For data visualization, we use DC.js which leverages the D3.js library to render graphs and CrossFilter.js to control data cross filters. In addition, we used the leaflet to render the map component.

You can find try the dashboard here.

Conclusion

Deploying the flask backend was another challenge. For financial reasons we chose to go for a free plan in the renderer platform. This comes with a little drawback as the the server takes up to 50 secs to get activated.

This project was a great start in the data engineering field. As it was an excellent opportunity to get a deeper understanding of all steps of data pipeline design and implementation.

End-to-End Data Engineering: OpenAQ API to Real-Time Dashboards Using Spark and Airflow

Ali MARZOUK

Software engineer | Interested in data engineering

Objective

Data source choice

System design

Extract - Load

领英推荐

Transform - Load

Pipeline orchestration

Visualization App

Conclusion

社区洞察

其他会员也浏览了

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

ProntoPro’s Data team - Gaining insights into the future of local services!

AWS Data Engineering Essentials Guidebook

A unified platform with Databricks & dbt

Elevate Your Data Pipeline Workflow with Kedro!

Streamlining Data Science with Jupyter and Ilum on Kubernetes

The CopilotKit Project, Data Engineering, MLOps, and Data Science Resources, Fine Tuning LLM

Data Engineering: From Zero ETL in the Past to LLM as the New Future.

Complete Step-by-Step Roadmap to Learn Data Engineering in 2025

DATA ENGINEERING: SKILLS IN DEMAND