End-to-End Data Engineering: OpenAQ API to Real-Time Dashboards Using Spark and Airflow

End-to-End Data Engineering: OpenAQ API to Real-Time Dashboards Using Spark and Airflow


Objective

The main goal behind this project is to gain hands-on knowledge of various tools used in the data engineering field. To achieve this goal we designed and implemented an ELT (or ELTL - Extract Load Transform Load) pipeline. And to get a further understanding of the exploited data, a minimal dashboard is available.

Data source choice

As any software engineer I have too many incomplete side projects. And as it turned out, the main reason for this lack of motivation is that the data I used to choose does not interest me. So for this project, I took the time to choose a consistent data source for which I have genuine interest. And thanks to joseph wibowo's advice I explored the public apis repo to find the OpenAQ API.

Using the OpenAQ API, I chose to conduct an exploration of the air quality in the area where I live which is the Paris (Ile de France) area.

System design

System design

The system is first composed of a ELTL data pipeline that extracts the data from the OpenAQ API and stores it in a data warehouse daily.?

The second part of the system is the data visualization App. It is composed of a web dashboard requesting data from a Rest API.

This next graph summarizes the technologies choices for the system.

Technical system design

For the cloud provider, the easiest choice was the Google Cloud Platform (GCP) given that it offers a 300$ trial for 90 days. By the way Data Talks Zoomcamp have excellent material for environments and softwares setups.

  • Data Lake: Google Cloud Storage (GCS), to store raw unstructured data extractions.
  • Data Warehouse: Big Query (BQ), to store transformed and structred data.
  • Data processing middleware: PySpark, optimized for processing high data volume. Setup steps can be found here.
  • Pipeline orchestration: Airflow, workflow management platform developed by Airbnb and maintained by the Apache Foundation.
  • Rest API : Flask, lightweight python framework.
  • Dashboard: dc.js library to render charts and manage crossfiltering.

Extract - Load

OpenAQ exposes a standard Rest API accessible through a free API Key. For this step we used Spark to easily manage important volume of data.?

The extracted data is being saved to a GCS bucket under a directory with the extraction date as a name.

You can find the full code for the spark_api_to_gcs.py under the “spark” directory.

Transform - Load

With the data stored in the GCS bucket (Data Lake), we can now transform and structure it into a star schema. This transformation is carried out using a PySpark job that processes the raw data and organizes it according to the following ERD.

Warehouse's Entity-Relationship Diagram

The data is then stored in a BigQuery dataset. All of this is done in the spark_gcs_to_bq.py job.

Pipeline orchestration

After implementing the Extract Load and Transform jobs, now we need an orchestration middleware to execute our pipeline daily.

Airflow is a great choice, considering its robust architecture and that it is well maintained by the apache foundation.

Our Airflow pipeline is very simple and minimal. It is composed of two jobs:

  • api_to_gcs: query the OpenAQ API and save it to GCS.
  • gcs_to_bq: transform data in GCS, structure it and save it into BigQuery dataset.

Visualization App

Our final app is composed of a Rest API and a dashboard frontend client.

For the backend Rest API we use Flask. It is the best choice for our use case, being a lightweight framework with a gradual learning curve.

To avoid overloading the network with large amounts of data, we divide data in several chunks and send it using Server Sent Event.

For data visualization, we use DC.js which leverages the D3.js library to render graphs and CrossFilter.js to control data cross filters. In addition, we used the leaflet to render the map component.

You can find try the dashboard here.


Conclusion

Deploying the flask backend was another challenge. For financial reasons we chose to go for a free plan in the renderer platform. This comes with a little drawback as the the server takes up to 50 secs to get activated.

This project was a great start in the data engineering field. As it was an excellent opportunity to get a deeper understanding of all steps of data pipeline design and implementation.

Emna Safraoui

Functional Consultant

9 个月

Great job Ali!

Tarek Jarrar

DevOps & Cloud Engineer @ Skale-5

9 个月

Well done Ali MARZOUK !

Yann Hochet

salesforce architect

9 个月

And this is done in free time with a very time consuming professional project. Congrats Ali MARZOUK :)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了