登录查看更多内容

Building an ETL App with Streamlit

O?uzhan Aslan

Data Engineer

发布日期: 2024年4月25日

This article aims to provide you an idea how to use Streamlit in different use cases and the use case of this article will be creating an ETL app with Streamlit.

What is ETL ?

Basically, ETL is a data integration process that combines Extracting, Transforming and Loading the data from a source to another. The reason behind the popularity of the ETL approach is that companies find it helpful in improving the quality, reliability and handiness of their data.

You can find lots of different ETL approaches and tools that offer various solutions to users, however here we will build our own ETL app which will be specific to our own use case, so once you understand the logic and concept, you can build your own app to meet your own requirements.

Streamlit ETL App

I will use VS Code to create this app but you can use any Pyhon IDE or even your GitHub account to create this python file.

First we need to install the requirements.

pip install streamlit

Now we can start building our app, let’s start with importing the necessary Python libraries. This is going to be a simple app so only the Streamlit and Pandas libraries will be enough.

import streamlit as st
import pandas as pd

We can give our app a title with streamlit title function. Then we can add a caption to explain out app’s funcitonalities with streamlit caption function.

st.title("The Streamlit ETL App ??? ")

st.caption("""
         With this app you will be able to Extract, Transform and Load Below File Types:
         \n1.CSV
         \n2.JSON
         \n3.XML
         \nps: You can upload multiple files.""")

Our app will have below functionalities:

Read CSV, JSON and XML file types.
Extract the data from these file types.
Transform the data as required.
Save and download it as CSV file.

Since we’ll be receiving files from the user, the streamlit file uploader function will come in handy. Next, we’ll utilize the pandas library to read the uploaded files, place them into an empty data frame, and then merge them. This means that users will be able to upload multiple files, which will be allowed by the file_uploader function.

uploaded_files = st.file_uploader("Choose a file", accept_multiple_files=True)

# let's create a function to check the file types and read them accordingly.

def extract(file_to_extract):
    if file_to_extract.name.split(".")[-1] == "csv": 
        extracted_data = pd.read_csv(file_to_extract)

    elif file_to_extract.name.split(".")[-1] == 'json':
         extracted_data = pd.read_json(file_to_extract, lines=True)

    elif file_to_extract.name.split(".")[-1] == 'xml':
         extracted_data = pd.read_xml(file_to_extract)
         
    return extracted_data

# create an empty list which will be used to merge the files.

dataframes = []

This is a just simple, straightforward way to verify the file format and read it with pandas functions. Now we will retrieve the uploaded files, and apply a logic to read, merge and transform them.

Here I believe that the extraction process is more standardized than the transformation process. We can obtain standard file types during extraction, but each file may have different use cases required. So we can adopt a standard approach to transform our data by leveraging the functionalities provided by the Pandas library. Specifically, we’ll focus on removing null values and duplicates. ????

if uploaded_files:
    for file in uploaded_files:
        file.seek(0)
        df = extract(file)
        dataframes.append(df)

    if len(dataframes) >= 1:
        merged_df = pd.concat(dataframes, ignore_index=True, join='outer')

    remove_duplicates = st.selectbox("Remove duplicate values ?", ["No", "Yes"])
    remove_nulls = st.selectbox("Remove null values in the dataset ?", ["Yes", "No"])

    if remove_duplicates == "Yes":
        merged_df.drop_duplicates(inplace=True)

    if remove_nulls == "Yes":
        merged_df.dropna(how="all", inplace=True)

    
    show_result = st.checkbox("Show Result", value=True)

    if show_result:
        st.write(merged_df)

    csv = merged_df.to_csv().encode("utf-8")

    st.download_button(label="Download cleaned data as csv",
                       data=csv,
                       file_name="cleaned_data.csv",
                       mime="text/csv")

Here you can find the breakdown of the above code:

Check if Files Were Uploaded:

With an if statement if uploaded_files:we check if any files were uploaded. If there are uploaded files, it proceeds to process them.

领英推荐

Efficiently Managing Ride and Late Arriving Tips Data…

Soumil S. 1 年前

Improving Legacy Code: Using Task Queue to Speed Up a…

Joao Junior 1 个月前

Building an ETL Pipeline to Process Web-Scraped Data…

Muhammad Usman 2 年前

Iterate Over Uploaded Files:

If there are uploaded files, the code iterates over each file using a for loop. Then it sets file cursor with seek function to the beginning. Then it calls our extractfunction to extract data from the files. The extracted data is stored in a dataframe called df. Appends the df to the emtpy dataframes list we created above .

Merge DataFrames (if applicable):

After processing all uploaded files, the code checks if there are at least 1 dataframe in the dataframes list. If so, it merges them into a single dataframe called merged_df .

User Interaction:

Here we apply out Transformation logic. The code ask users if they want to remove the duplicated and null values. Here we get user feedback with streamlit selectbox function. You can use another user interaction widget provided from Streamlit, it all depends on your requirements.
If the users choose to remove the duplicates and null values, the code drops them from the merged_dfdataframe.

Display Result (if requested):

This code may seem unnecessary however i just wanted provide user an option if they want to see the result or not. The streamlit checkbox function creates a checkbox labeled “Show Result”. If the user selects this checkbox, it displays the final content of the merged_df.

Export Cleaned Data as CSV:

The last part of the code is for Loading part of our ETL process. It basically converts the merged_df to a CSV format and encodes it as utf-8. With streamlit download_button function we create a download button which allows the user to download the cleaned data as a CSV file named “cleaned_data.csv”.

Here we can add a try except block to test the code for errors which might warn the users if there is any error, however we prefer to keep the code simple and I believe it is enough to provide the logic of our concept.

Conclusion

The Streamlit-powered ETL tool represents a nice way to create apps for different use cases. By combining different python capabilities, the contents you might create are limited only with your imagination.

Here you can find the final code block:

import streamlit as st
import pandas as pd
import xml.etree.ElementTree as ET 
from datetime import datetime 

st.title("The Streamlit ETL App ??? ")

st.caption("""
         With this app you will be able to Extract, Transform and Load Below File Types:
         \n1.CSV
         \n2.JSON
         \n3.XML
         \nps: You can upload multiple files.""")

uploaded_files = st.file_uploader("Choose a file", accept_multiple_files=True)

     
def extract(file_to_extract):
    if file_to_extract.name.split(".")[-1] == "csv": 
        extracted_data = pd.read_csv(file_to_extract)

    elif file_to_extract.name.split(".")[-1] == 'json':
         extracted_data = pd.read_json(file_to_extract, lines=True)

    elif file_to_extract.name.split(".")[-1] == 'xml':
         extracted_data = pd.read_xml(file_to_extract)
         
    return extracted_data

dataframes = []


if uploaded_files:
    for file in uploaded_files:
        file.seek(0)
        df = extract(file)
        dataframes.append(df)

    if len(dataframes) >= 1:
        merged_df = pd.concat(dataframes, ignore_index=True, join='outer')

    remove_duplicates = st.selectbox("Remove duplicate values ?", ["No", "Yes"])
    remove_nulls = st.selectbox("Remove null values in the dataset ?", ["Yes", "No"])

    if remove_duplicates == "Yes":
        merged_df.drop_duplicates(inplace=True)

    if remove_nulls == "Yes":
        merged_df.dropna(how="all", inplace=True)

    
    show_result = st.checkbox("Show Result", value=True)

    if show_result:
        st.write(merged_df)

    csv = merged_df.to_csv().encode("utf-8")

    st.download_button(label="Download cleaned data as csv",
                       data=csv,
                       file_name="cleaned_data.csv",
                       mime="text/csv")

Running a Demo of the App

As final touch you can take your app a step further by deploying it on Streamlit Cloud and sharing publicly. Do not worry, it is easier than creating an app. Start by adding your app to GitHub. Next on Streamlit Cloud, you can directly connect to your app from your GitHub account and deploy it.

Voilà your app is ready for public use !

For more and detailed info, you can check streamlit document which provides a step by step guide to deploy your app.

Deploy Your App

Koenraad Block

Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance

11 个月

Thanks for shedding light on the transformative power of ETL! ????"

1 次回应

查看更多评论

要查看或添加评论，请登录

O?uzhan Aslan的更多文章

Streamlit, The Magic of Data Storytelling

2024年4月10日

Streamlit, The Magic of Data Storytelling

In today’s world, the value of data is well known even by individuals who are not related with any technical…

Building an ETL App with Streamlit

O?uzhan Aslan

Data Engineer

What is ETL ?

Streamlit ETL App

领英推荐

Conclusion

O?uzhan Aslan的更多文章

社区洞察

其他会员也浏览了

Building an ETL Pipeline to Process Web-Scraped Data to DB and Visualizing Data Using pgAdmin 4

What is Delta Live Tables?

OLAP ETL using Python and MySQL with dynamic ER Design using GraphViz and Plotly.

How I built a Live Weather Data ETL using GitHub Actions

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 4 of 5

Data quality checks with Apache Airflow, Soda-Core and Pandas dataframes

Complex nested JSON Transformation using Spark?—?RDD

Streamline Your ETL Process with PDI Pentaho and Python: A Data Integration Extravaganza

Working with Relational Databases Using SQLAlchemy

What is ETL ?

Streamlit ETL App

领英推荐

Conclusion

O?uzhan Aslan的更多文章

Streamlit, The Magic of Data Storytelling

社区洞察

其他会员也浏览了

Building an ETL Pipeline to Process Web-Scraped Data to DB and Visualizing Data Using pgAdmin 4

What is Delta Live Tables?

OLAP ETL using Python and MySQL with dynamic ER Design using GraphViz and Plotly.

How I built a Live Weather Data ETL using GitHub Actions

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 4 of 5

Data quality checks with Apache Airflow, Soda-Core and Pandas dataframes

Complex nested JSON Transformation using Spark?—?RDD

Streamline Your ETL Process with PDI Pentaho and Python: A Data Integration Extravaganza

Working with Relational Databases Using SQLAlchemy