End-to-End Data Engineering: OpenAQ API to Real-Time Dashboards Using Spark and Airflow
Objective
The main goal behind this project is to gain hands-on knowledge of various tools used in the data engineering field. To achieve this goal we designed and implemented an ELT (or ELTL - Extract Load Transform Load) pipeline. And to get a further understanding of the exploited data, a minimal dashboard is available.
Data source choice
As any software engineer I have too many incomplete side projects. And as it turned out, the main reason for this lack of motivation is that the data I used to choose does not interest me. So for this project, I took the time to choose a consistent data source for which I have genuine interest. And thanks to joseph wibowo's advice I explored the public apis repo to find the OpenAQ API.
Using the OpenAQ API, I chose to conduct an exploration of the air quality in the area where I live which is the Paris (Ile de France) area.
System design
The system is first composed of a ELTL data pipeline that extracts the data from the OpenAQ API and stores it in a data warehouse daily.?
The second part of the system is the data visualization App. It is composed of a web dashboard requesting data from a Rest API.
This next graph summarizes the technologies choices for the system.
For the cloud provider, the easiest choice was the Google Cloud Platform (GCP) given that it offers a 300$ trial for 90 days. By the way Data Talks Zoomcamp have excellent material for environments and softwares setups.
Extract - Load
OpenAQ exposes a standard Rest API accessible through a free API Key. For this step we used Spark to easily manage important volume of data.?
The extracted data is being saved to a GCS bucket under a directory with the extraction date as a name.
You can find the full code for the spark_api_to_gcs.py under the “spark” directory.
领英推荐
Transform - Load
With the data stored in the GCS bucket (Data Lake), we can now transform and structure it into a star schema. This transformation is carried out using a PySpark job that processes the raw data and organizes it according to the following ERD.
The data is then stored in a BigQuery dataset. All of this is done in the spark_gcs_to_bq.py job.
Pipeline orchestration
After implementing the Extract Load and Transform jobs, now we need an orchestration middleware to execute our pipeline daily.
Airflow is a great choice, considering its robust architecture and that it is well maintained by the apache foundation.
Our Airflow pipeline is very simple and minimal. It is composed of two jobs:
Visualization App
Our final app is composed of a Rest API and a dashboard frontend client.
For the backend Rest API we use Flask. It is the best choice for our use case, being a lightweight framework with a gradual learning curve.
To avoid overloading the network with large amounts of data, we divide data in several chunks and send it using Server Sent Event.
For data visualization, we use DC.js which leverages the D3.js library to render graphs and CrossFilter.js to control data cross filters. In addition, we used the leaflet to render the map component.
Conclusion
Deploying the flask backend was another challenge. For financial reasons we chose to go for a free plan in the renderer platform. This comes with a little drawback as the the server takes up to 50 secs to get activated.
This project was a great start in the data engineering field. As it was an excellent opportunity to get a deeper understanding of all steps of data pipeline design and implementation.
Functional Consultant
9 个月Great job Ali!
DevOps & Cloud Engineer @ Skale-5
9 个月Well done Ali MARZOUK !
salesforce architect
9 个月And this is done in free time with a very time consuming professional project. Congrats Ali MARZOUK :)