Excel Isn't Going Anywhere, So Let's Automate Parsing?It

Excel Isn't Going Anywhere, So Let's Automate Parsing?It

Introduction

Excel remains a significant part of our work lives. Despite modern tools and technologies, we frequently handle Excel files?—?particularly those filled in manually. Parsing these files manually is both time-consuming and challenging.

In this guide, I’ll share how I built a simple function that finds and extracts tables from messy Excel files using Python and Pandas. Best of all, you can adapt this method to work with Polars or any other DataFrame library.

Why automate Excel parsing? Three words: Efficiency, Consistency, Scalability.

Excel files, especially those filled out manually, can be unpredictable:

  • Tables starting at random positions
  • Headers in unexpected locations
  • Extra data scattered throughout sheets

Automating the parsing process ensures consistent data extraction, regardless of how messy the file is.

The Challenge

When receiving Excel files, you might encounter these challenges:

  • Tables starting at random positions, not cell A1
  • Empty rows and columns scattered throughout
  • Header rows in non-obvious locations

Additionally, each file from stakeholders might have a different structure than previous ones.

Our goal is to efficiently extract table data, focusing only on the columns we REQUIRE for analysis. In my experience, there’s just one crucial requirement for stakeholders: maintain consistent column names in the header.

The Solution

To handle Excel file parsing, I built a generic Python functions:

  1. log_issue()?—?to log any issues, comments I have about the file I received.
  2. find_excel_table()?—?to find the table inside the received Excel file and return as a DataFrame.
  3. format_issues_for_email?—?to convert the issues into human-readable text. I need it to send issues back to stakeholders and let them know which Excel files where rejected for parsing, and why.

The solution steps:

  1. Read Excel file.
  2. Search for the header row based on required column names.
  3. Extract the table starting from the header row.
  4. Handle any issues, such as missing columns, gracefully.
  5. Convert issues dictionary into human-readable text.

Here are the functions I use, and we’ll break them down step by step.

import pandas as pd
from typing import List, Tuple, Optional

# Helper function to log issues
def log_issue(issues: List[str], message: str):
    """Append an issue message to the issues list."""
    issues.append(message)

# Main function to find and extract the table
def find_excel_table(
    xls_file_path: str, required_columns: List[str], sheet_name: str="Sheet1"
) -> Tuple[Optional[pd.DataFrame], List[str]]:
    """Identify and extract the table with required columns from an Excel sheet."""
    issues = []

    try:
        # Read the entire Excel sheet
        df = pd.read_excel(
            xls_file_path, sheet_name=sheet_name, engine="openpyxl", dtype=str
        )
        print("Excel file read successfully.")
    except Exception as e:
        log_issue(issues, f"Error reading Excel file '{xls_file_path}': {e}")
        return None, issues

    header_row_index = None
    # Iterate over each row to find the header
    for idx, row in df.iterrows():
        row_values = [
            str(x).strip() for x in row.tolist()
            if str(x).strip() != "" and str(x).strip().lower() != "nan"
        ]
        if set(required_columns).issubset(set(row_values)):
            header_row_index = idx
            print(f"Header found at row index: {header_row_index}")
            break

    if header_row_index is None:
        log_issue(issues, "Required columns not found in any row.")
        return None, issues

    # Slice the DataFrame starting from the header row
    df = df.iloc[header_row_index:].reset_index(drop=True)
    df.columns = df.iloc[0]  # Set the first row as header
    df = df[1:]  # Remove the header row from data

    # Return only the required columns
    return df[required_columns], issues        

Step-by-Step Breakdown

Import Necessary Libraries

We start by importing Pandas and typing tools for type hints. I love using type hints, they make my code more readable and easy to understand both for developers and linters.

import pandas as pd
from typing import List, Tuple, Optional        

Helper Function to Log?Issues

This function collects any issues encountered during the parsing process.

def log_issue(issues: List[str], message: str):
    issues.append(message)        

The Main Function?—?find_excel_table

The find_excel_table function is where the magic happens.

Function Definition:

def find_excel_table(
    xls_file_path: str, sheet_name: str, required_columns: List[str]
) -> Tuple[Optional[pd.DataFrame], List[str]]:        

  • xls_file_path?—?Path to the Excel file.
  • sheet_name?—?The sheet in the Excel file to read.
  • required_columns?—?A list of column names you need to extract.

Reading the Excel?File

This is the first step. We attempt to read the Excel file using pd.read_excel?.?

Important Note: The dtype=str parameter ensures all data is read as strings, which helps avoid issues with mixed data types. We don't want Pandas (or Polars) to infer column types for a key reason: when stakeholders manually fill Excel files, they often make formatting errors or add notes directly in the "production" template?—?for example, using commas instead of dots for decimal points.

We want the “read” step to succeed regardless of these inconsistencies. It’s better to get the data into a DataFrame first, then process it as needed, and decide what to do next.

df = pd.read_excel(
    xls_file_path, sheet_name=sheet_name, engine="openpyxl", dtype=str
)        

Efficiency Note: I tend to use calamine engine whenever possible?—?in Polars, it’s a default engine.

Finding the Header?Row

Now, this is an interesting part. We iterate over each row to find where the header starts.

header_row_index = None
for idx, row in df.iterrows():
    row_values = [
        str(x).strip() for x in row.tolist()
        if str(x).strip() != "" and str(x).strip().lower() != "nan"
    ]
    if set(required_columns).issubset(set(row_values)):
        header_row_index = idx
        break        

  • row_values?—?cleaned-up list of values from the current row.
  • We check if our required columns are a subset of row_values?—?columns are in the row_values.
  • If we find the header, we store its index in header_row_index variable and break out of the loop.

Handling Missing?Headers

But what if we can’t find the header row? In that case, we log an issue and exit the function.

if header_row_index is None:
    log_issue(issues, "Required columns not found in any row.")
    return None, issues        

Extracting the?Table

Once we’ve found the header row:

df = df.iloc[header_row_index:].reset_index(drop=True)
df.columns = df.iloc[0]  # Set the first row as the header
df = df[1:]  # Remove the header row from the data        

  • We take all rows starting from the header row.
  • We set the first row in this slice as the header.
  • We remove the header row from the data.

Returning the?Results

Finally, we return the DataFrame containing only the required columns and any issues encountered.

return df[required_columns], issues        

While I generally avoid returning multiple values, I find this approach makes Excel parsing pipelines easier to handle?—?especially when sending formatted alerts to stakeholders.

Using the?Function

Now, let’s see how to use this function in practice.

required_columns = ["Name", "Email", "Phone"]
xls_file_path = "path/to/your/excel_file.xlsx"
sheet_name = "Sheet1"

df, issues = find_excel_table(xls_file_path, required_columns, sheet_name)

if df is not None:
    print("Table extracted successfully!")
    print(df.head())
else:
    print("Failed to extract table.")
    for issue in issues:
        print(issue)        

Bonus?—?Sending Alerts to Stakeholders

Need to notify stakeholders about issues encountered during the parsing process? I often do this to help them correct and resubmit the file.

Here’s a function that converts an issues dictionary into structured text for an email body:

from typing import Dict

def format_issues_for_email(file_issues: Dict[str, List[str]]) -> str:
    """Formats the issues dictionary into a readable email body."""
    email_body = ["**File Issues Report:**\n"]

    for i, (filename, issues) in enumerate(file_issues.items(), start=1):
        if not issues:
            continue
        email_body.append(f"{i}. *{filename}*")
        for issue in issues:
            email_body.append(f"   - {issue}")
        email_body.append("")  # Adds a blank line between files

    return "\n".join(email_body)        

  • file_issues?—?A dictionary where each key is a filename, and the value is a list of issues associated with that file.
  • The function returns a formatted string.

Example Usage

Let’s say you processed multiple Excel files and collected issues for each one.

file_issues = {
    "excel_file1.xlsx": ["Missing columns: Email, Phone"],
    "excel_file2.xlsx": ["Error reading Excel file: File not found"],
    "excel_file3.xlsx": []
}

email_body = format_issues_for_email(file_issues)

print(email_body)        

Output:

**File Issues Report:**

1. *excel_file1.xlsx*
   - Missing columns: Email, Phone

2. *excel_file2.xlsx*
   - Error reading Excel file: File not found        

You can now include email_body in an email to stakeholders, informing them of any issues that need attention.


Image by Author - Automating Excel Parsing
You can copy my Google Colab Notebook to try the above functions yourself.

Conclusion

To summarize, automating Excel file parsing can significantly improve your data extraction process. In my experience, the process typically involves these steps:

  1. Read the table from the Excel file into memory. I’ve demonstrated one approach to automating this process in this article.
  2. Apply custom business logic to transform the DataFrame into the necessary structure.
  3. Write the transformed DataFrame to the target table. In most cases, the main goal is to transfer data from Excel into a database.
  4. Optionally, send alerts to stakeholders about any parsing issues.

I hope this guide helps with your data handling tasks. Feel free to modify the functions to suit your specific needs, and don’t hesitate to reach out with questions or suggestions?—?I’m always open to feedback.


Enjoyed the article? Support my newsletter by following the Paypalme Link. ??

Authors Note:

? Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated.

?? You can also follow my work on:

Matthew Madahar

Aspen Data Lab | Predictive Analytics | Explainable AI | Gen AI | Big Data

3 个月

I prefer to unpivot/melt every worksheet. Just like the tidyxl R package but in Python.

要查看或添加评论,请登录

Lasha Dolenjashvili的更多文章