登录查看更多内容

Excel Isn't Going Anywhere, So Let's Automate Parsing?It

Lasha Dolenjashvili

Data Solutions Architect @ Bank of Georgia | IIBA? Certified Business Analyst | Open to Freelance, Remote, or Relocation Opportunities

发布日期: 2024年11月11日

Introduction

Excel remains a significant part of our work lives. Despite modern tools and technologies, we frequently handle Excel files?—?particularly those filled in manually. Parsing these files manually is both time-consuming and challenging.

In this guide, I’ll share how I built a simple function that finds and extracts tables from messy Excel files using Python and Pandas. Best of all, you can adapt this method to work with Polars or any other DataFrame library.

Why automate Excel parsing? Three words: Efficiency, Consistency, Scalability.

Excel files, especially those filled out manually, can be unpredictable:

Tables starting at random positions
Headers in unexpected locations
Extra data scattered throughout sheets

Automating the parsing process ensures consistent data extraction, regardless of how messy the file is.

The Challenge

When receiving Excel files, you might encounter these challenges:

Tables starting at random positions, not cell A1
Empty rows and columns scattered throughout
Header rows in non-obvious locations

Additionally, each file from stakeholders might have a different structure than previous ones.

Our goal is to efficiently extract table data, focusing only on the columns we REQUIRE for analysis. In my experience, there’s just one crucial requirement for stakeholders: maintain consistent column names in the header.

The Solution

To handle Excel file parsing, I built a generic Python functions:

log_issue()?—?to log any issues, comments I have about the file I received.
find_excel_table()?—?to find the table inside the received Excel file and return as a DataFrame.
format_issues_for_email?—?to convert the issues into human-readable text. I need it to send issues back to stakeholders and let them know which Excel files where rejected for parsing, and why.

The solution steps:

Read Excel file.
Search for the header row based on required column names.
Extract the table starting from the header row.
Handle any issues, such as missing columns, gracefully.
Convert issues dictionary into human-readable text.

Here are the functions I use, and we’ll break them down step by step.

import pandas as pd
from typing import List, Tuple, Optional

# Helper function to log issues
def log_issue(issues: List[str], message: str):
    """Append an issue message to the issues list."""
    issues.append(message)

# Main function to find and extract the table
def find_excel_table(
    xls_file_path: str, required_columns: List[str], sheet_name: str="Sheet1"
) -> Tuple[Optional[pd.DataFrame], List[str]]:
    """Identify and extract the table with required columns from an Excel sheet."""
    issues = []

    try:
        # Read the entire Excel sheet
        df = pd.read_excel(
            xls_file_path, sheet_name=sheet_name, engine="openpyxl", dtype=str
        )
        print("Excel file read successfully.")
    except Exception as e:
        log_issue(issues, f"Error reading Excel file '{xls_file_path}': {e}")
        return None, issues

    header_row_index = None
    # Iterate over each row to find the header
    for idx, row in df.iterrows():
        row_values = [
            str(x).strip() for x in row.tolist()
            if str(x).strip() != "" and str(x).strip().lower() != "nan"
        ]
        if set(required_columns).issubset(set(row_values)):
            header_row_index = idx
            print(f"Header found at row index: {header_row_index}")
            break

    if header_row_index is None:
        log_issue(issues, "Required columns not found in any row.")
        return None, issues

    # Slice the DataFrame starting from the header row
    df = df.iloc[header_row_index:].reset_index(drop=True)
    df.columns = df.iloc[0]  # Set the first row as header
    df = df[1:]  # Remove the header row from data

    # Return only the required columns
    return df[required_columns], issues

Step-by-Step Breakdown

Import Necessary Libraries

We start by importing Pandas and typing tools for type hints. I love using type hints, they make my code more readable and easy to understand both for developers and linters.

import pandas as pd
from typing import List, Tuple, Optional

Helper Function to Log?Issues

This function collects any issues encountered during the parsing process.

def log_issue(issues: List[str], message: str):
    issues.append(message)

The Main Function?—?find_excel_table

The find_excel_table function is where the magic happens.

Function Definition:

def find_excel_table(
    xls_file_path: str, sheet_name: str, required_columns: List[str]
) -> Tuple[Optional[pd.DataFrame], List[str]]:

xls_file_path?—?Path to the Excel file.
sheet_name?—?The sheet in the Excel file to read.
required_columns?—?A list of column names you need to extract.

Reading the Excel?File

This is the first step. We attempt to read the Excel file using pd.read_excel?.?

Important Note: The dtype=str parameter ensures all data is read as strings, which helps avoid issues with mixed data types. We don't want Pandas (or Polars) to infer column types for a key reason: when stakeholders manually fill Excel files, they often make formatting errors or add notes directly in the "production" template?—?for example, using commas instead of dots for decimal points.

We want the “read” step to succeed regardless of these inconsistencies. It’s better to get the data into a DataFrame first, then process it as needed, and decide what to do next.

df = pd.read_excel(
    xls_file_path, sheet_name=sheet_name, engine="openpyxl", dtype=str
)

Efficiency Note: I tend to use calamine engine whenever possible?—?in Polars, it’s a default engine.

Finding the Header?Row

Now, this is an interesting part. We iterate over each row to find where the header starts.

header_row_index = None
for idx, row in df.iterrows():
    row_values = [
        str(x).strip() for x in row.tolist()
        if str(x).strip() != "" and str(x).strip().lower() != "nan"
    ]
    if set(required_columns).issubset(set(row_values)):
        header_row_index = idx
        break

row_values?—?cleaned-up list of values from the current row.
We check if our required columns are a subset of row_values?—?columns are in the row_values.
If we find the header, we store its index in header_row_index variable and break out of the loop.

Handling Missing?Headers

But what if we can’t find the header row? In that case, we log an issue and exit the function.

if header_row_index is None:
    log_issue(issues, "Required columns not found in any row.")
    return None, issues

Extracting the?Table

Once we’ve found the header row:

df = df.iloc[header_row_index:].reset_index(drop=True)
df.columns = df.iloc[0]  # Set the first row as the header
df = df[1:]  # Remove the header row from the data

We take all rows starting from the header row.
We set the first row in this slice as the header.
We remove the header row from the data.

Returning the?Results

Finally, we return the DataFrame containing only the required columns and any issues encountered.

return df[required_columns], issues

While I generally avoid returning multiple values, I find this approach makes Excel parsing pipelines easier to handle?—?especially when sending formatted alerts to stakeholders.

Using the?Function

Now, let’s see how to use this function in practice.

required_columns = ["Name", "Email", "Phone"]
xls_file_path = "path/to/your/excel_file.xlsx"
sheet_name = "Sheet1"

df, issues = find_excel_table(xls_file_path, required_columns, sheet_name)

if df is not None:
    print("Table extracted successfully!")
    print(df.head())
else:
    print("Failed to extract table.")
    for issue in issues:
        print(issue)

Bonus?—?Sending Alerts to Stakeholders

Need to notify stakeholders about issues encountered during the parsing process? I often do this to help them correct and resubmit the file.

Here’s a function that converts an issues dictionary into structured text for an email body:

from typing import Dict

def format_issues_for_email(file_issues: Dict[str, List[str]]) -> str:
    """Formats the issues dictionary into a readable email body."""
    email_body = ["**File Issues Report:**\n"]

    for i, (filename, issues) in enumerate(file_issues.items(), start=1):
        if not issues:
            continue
        email_body.append(f"{i}. *{filename}*")
        for issue in issues:
            email_body.append(f"   - {issue}")
        email_body.append("")  # Adds a blank line between files

    return "\n".join(email_body)

file_issues?—?A dictionary where each key is a filename, and the value is a list of issues associated with that file.
The function returns a formatted string.

Example Usage

Let’s say you processed multiple Excel files and collected issues for each one.

file_issues = {
    "excel_file1.xlsx": ["Missing columns: Email, Phone"],
    "excel_file2.xlsx": ["Error reading Excel file: File not found"],
    "excel_file3.xlsx": []
}

email_body = format_issues_for_email(file_issues)

print(email_body)

Output:

**File Issues Report:**

1. *excel_file1.xlsx*
   - Missing columns: Email, Phone

2. *excel_file2.xlsx*
   - Error reading Excel file: File not found

You can now include email_body in an email to stakeholders, informing them of any issues that need attention.

Image by Author - Automating Excel Parsing

You can copy my Google Colab Notebook to try the above functions yourself.

Conclusion

To summarize, automating Excel file parsing can significantly improve your data extraction process. In my experience, the process typically involves these steps:

Read the table from the Excel file into memory. I’ve demonstrated one approach to automating this process in this article.
Apply custom business logic to transform the DataFrame into the necessary structure.
Write the transformed DataFrame to the target table. In most cases, the main goal is to transfer data from Excel into a database.
Optionally, send alerts to stakeholders about any parsing issues.

I hope this guide helps with your data handling tasks. Feel free to modify the functions to suit your specific needs, and don’t hesitate to reach out with questions or suggestions?—?I’m always open to feedback.

Enjoyed the article? Support my newsletter by following the Paypalme Link. ??

Authors Note:

? Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated.

?? You can also follow my work on:

SA Space

1,046 位关注者

Matthew Madahar

Aspen Data Lab | Predictive Analytics | Explainable AI | Gen AI | Big Data

3 个月

I prefer to unpivot/melt every worksheet. Just like the tidyxl R package but in Python.

1 次回应

要查看或添加评论，请登录

Lasha Dolenjashvili的更多文章

Printing Your Machine's Specifications with?Python

2024年11月18日

Printing Your Machine's Specifications with?Python

Introduction Have you ever needed to check your machine’s specs quickly, perhaps before running a resource-intensive…
Building Document Parsing Pipelines with Python

2024年11月3日

Building Document Parsing Pipelines with Python

Why Parse Documents? Recently, I've been working with various document parsing challenges at work. Our systems generate…

4 条评论
Introduction to Network Analysis with Neo4j, AuraDB, and Python ???

2024年10月27日

Introduction to Network Analysis with Neo4j, AuraDB, and Python ???

Learning Objectives of the Article: Introduction to Network Analysis Introduction to Neo4j and AuraDB Generating…
Generating 1 Billion Rows of Complex Synthetic Data ??

2024年10月25日

Generating 1 Billion Rows of Complex Synthetic Data ??

Recently, my coworker introduced me to a library for generating synthetic data - dbldatagen. What is it? "The…

2 条评论
Gaps & Islands: Number of Consecutive Days in SQL

2023年10月8日

Gaps & Islands: Number of Consecutive Days in SQL

In my latest post about Gaps & Islands problem, I promised to provide a real-world example that involves finding the…

2 条评论
Exploring SQL without Window Functions (Part II) - Examples

2023年10月7日

Exploring SQL without Window Functions (Part II) - Examples

Today we will explore different techniques of how we might approach common SQL challenges, but without Window…
SQL’s EXISTS and NOT EXISTS: A Comprehensive Guide

2023年10月5日

SQL’s EXISTS and NOT EXISTS: A Comprehensive Guide

Let's learn about two powerful SQL constructs: and . What are EXISTS and NOT EXISTS? The clause is used to test for the…

2 条评论
Exploring SQL without Window Functions (Part I)

2023年10月4日

Exploring SQL without Window Functions (Part I)

What if SQL did not have window functions? Can we even imagine such a world? Window functions are, after all, the heart…
SQL’s Order of Execution

2023年10月2日

SQL’s Order of Execution

In the world of data, SQL (Structured Query Language) is a widely used tool. It’s a language that is crucial for…

5 条评论

See all articles