Building an ETL App with Streamlit
This article aims to provide you an idea how to use Streamlit in different use cases and the use case of this article will be creating an ETL app with Streamlit.
What is ETL ?
Basically, ETL is a data integration process that combines Extracting, Transforming and Loading the data from a source to another. The reason behind the popularity of the ETL approach is that companies find it helpful in improving the quality, reliability and handiness of their data.
You can find lots of different ETL approaches and tools that offer various solutions to users, however here we will build our own ETL app which will be specific to our own use case, so once you understand the logic and concept, you can build your own app to meet your own requirements.
Streamlit ETL App
I will use VS Code to create this app but you can use any Pyhon IDE or even your GitHub account to create this python file.
First we need to install the requirements.
pip install streamlit
Now we can start building our app, let’s start with importing the necessary Python libraries. This is going to be a simple app so only the Streamlit and Pandas libraries will be enough.
import streamlit as st
import pandas as pd
We can give our app a title with streamlit title function. Then we can add a caption to explain out app’s funcitonalities with streamlit caption function.
st.title("The Streamlit ETL App ??? ")
st.caption("""
With this app you will be able to Extract, Transform and Load Below File Types:
\n1.CSV
\n2.JSON
\n3.XML
\nps: You can upload multiple files.""")
Our app will have below functionalities:
Since we’ll be receiving files from the user, the streamlit file uploader function will come in handy. Next, we’ll utilize the pandas library to read the uploaded files, place them into an empty data frame, and then merge them. This means that users will be able to upload multiple files, which will be allowed by the file_uploader function.
uploaded_files = st.file_uploader("Choose a file", accept_multiple_files=True)
# let's create a function to check the file types and read them accordingly.
def extract(file_to_extract):
if file_to_extract.name.split(".")[-1] == "csv":
extracted_data = pd.read_csv(file_to_extract)
elif file_to_extract.name.split(".")[-1] == 'json':
extracted_data = pd.read_json(file_to_extract, lines=True)
elif file_to_extract.name.split(".")[-1] == 'xml':
extracted_data = pd.read_xml(file_to_extract)
return extracted_data
# create an empty list which will be used to merge the files.
dataframes = []
This is a just simple, straightforward way to verify the file format and read it with pandas functions. Now we will retrieve the uploaded files, and apply a logic to read, merge and transform them.
Here I believe that the extraction process is more standardized than the transformation process. We can obtain standard file types during extraction, but each file may have different use cases required. So we can adopt a standard approach to transform our data by leveraging the functionalities provided by the Pandas library. Specifically, we’ll focus on removing null values and duplicates. ????
if uploaded_files:
for file in uploaded_files:
file.seek(0)
df = extract(file)
dataframes.append(df)
if len(dataframes) >= 1:
merged_df = pd.concat(dataframes, ignore_index=True, join='outer')
remove_duplicates = st.selectbox("Remove duplicate values ?", ["No", "Yes"])
remove_nulls = st.selectbox("Remove null values in the dataset ?", ["Yes", "No"])
if remove_duplicates == "Yes":
merged_df.drop_duplicates(inplace=True)
if remove_nulls == "Yes":
merged_df.dropna(how="all", inplace=True)
show_result = st.checkbox("Show Result", value=True)
if show_result:
st.write(merged_df)
csv = merged_df.to_csv().encode("utf-8")
st.download_button(label="Download cleaned data as csv",
data=csv,
file_name="cleaned_data.csv",
mime="text/csv")
Here you can find the breakdown of the above code:
Check if Files Were Uploaded:
Iterate Over Uploaded Files:
Merge DataFrames (if applicable):
User Interaction:
Display Result (if requested):
Export Cleaned Data as CSV:
Here we can add a try except block to test the code for errors which might warn the users if there is any error, however we prefer to keep the code simple and I believe it is enough to provide the logic of our concept.
Conclusion
The Streamlit-powered ETL tool represents a nice way to create apps for different use cases. By combining different python capabilities, the contents you might create are limited only with your imagination.
Here you can find the final code block:
import streamlit as st
import pandas as pd
import xml.etree.ElementTree as ET
from datetime import datetime
st.title("The Streamlit ETL App ??? ")
st.caption("""
With this app you will be able to Extract, Transform and Load Below File Types:
\n1.CSV
\n2.JSON
\n3.XML
\nps: You can upload multiple files.""")
uploaded_files = st.file_uploader("Choose a file", accept_multiple_files=True)
def extract(file_to_extract):
if file_to_extract.name.split(".")[-1] == "csv":
extracted_data = pd.read_csv(file_to_extract)
elif file_to_extract.name.split(".")[-1] == 'json':
extracted_data = pd.read_json(file_to_extract, lines=True)
elif file_to_extract.name.split(".")[-1] == 'xml':
extracted_data = pd.read_xml(file_to_extract)
return extracted_data
dataframes = []
if uploaded_files:
for file in uploaded_files:
file.seek(0)
df = extract(file)
dataframes.append(df)
if len(dataframes) >= 1:
merged_df = pd.concat(dataframes, ignore_index=True, join='outer')
remove_duplicates = st.selectbox("Remove duplicate values ?", ["No", "Yes"])
remove_nulls = st.selectbox("Remove null values in the dataset ?", ["Yes", "No"])
if remove_duplicates == "Yes":
merged_df.drop_duplicates(inplace=True)
if remove_nulls == "Yes":
merged_df.dropna(how="all", inplace=True)
show_result = st.checkbox("Show Result", value=True)
if show_result:
st.write(merged_df)
csv = merged_df.to_csv().encode("utf-8")
st.download_button(label="Download cleaned data as csv",
data=csv,
file_name="cleaned_data.csv",
mime="text/csv")
Running a Demo of the App
As final touch you can take your app a step further by deploying it on Streamlit Cloud and sharing publicly. Do not worry, it is easier than creating an app. Start by adding your app to GitHub. Next on Streamlit Cloud, you can directly connect to your app from your GitHub account and deploy it.
Voilà your app is ready for public use !
For more and detailed info, you can check streamlit document which provides a step by step guide to deploy your app.
Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance
11 个月Thanks for shedding light on the transformative power of ETL! ????"