Reading XLSX Files Quickly in Python
Image generated using Microsoft Designer

Reading XLSX Files Quickly in Python

This article dives into efficient methods for reading XLSX files into Python DataFrames. While Pandas offers a powerful set of tools, reading large files can become a bottleneck. We'll explore faster alternatives to streamline your data analysis process.

Introduction

XLSX files, the cornerstone of Microsoft Excel, hold a vital role in data analysis. They store and organize information, making them a go-to tool for anyone working with numbers. However, reading these files into a Pandas DataFrame can encounter performance issues.

We'll compare libraries based on reading times using the get_read_time function that measures execution time. This ensures a fair comparison across libraries we will read the file 10 for each method and get the average run time.

import time
from typing import Callable


def get_read_time(func: Callable, *args, **kwargs):
    start_time = time.time()
    func(*args, **kwargs)
    period = time.time() - start_time
    return period        

Approaches

Openpyxl

  • Suitable for smaller datasets.
  • Slower performance compared to other methods.
  • Requires many lines of code (not elegant)

from pathlib import Path

import pandas as pd
import openpyxl


def read_openpyxl(f: Path | str):
    wb = openpyxl.load_workbook(f)
    ws = wb.active
    data = ws.values
    cols = next(data)
    pd.DataFrame(data, columns=cols)
    wb.close()

openpyxl_excel = []
for _ in range(10):
    for _ in range(10):
    openpyxl_excel.append(get_read_time(read_openpyxl, excel_file))        

Calamine

  • A fast XLSX reader written in Rust.
  • Relatively new and offers significant speed improvements.

pip install python-calamine        
from python_calamine import CalamineWorkbook


def read_calamine(f: str | Path):
    data = CalamineWorkbook.from_path(f)
    data = data.get_sheet_by_index(0).to_python()
    df = pd.DataFrame(data[1:], columns=data[0])
    return df        

Pandas

  • One of the most popular libraries among Data Engineers and Data Scientist.
  • Offer an easy API to read xlsx files using different engines.
  • Pandas intoduced calamine engine from Pandas 2.2

pip install -U pandas        
openpyxl_engine = []
calamine_engine = []

for _ in range(10):
    openpyxl_engine.append(get_read_time(pd.read_excel, "Coffee Shop Sales.xlsx", engine="openpyxl"))

for _ in range(10):
    calamine_engine.append(get_read_time(pd.read_excel, "Coffee Shop Sales.xlsx", engine="calamine"))          

Polar

  • High-performance DataFrame library based on Rust with multi-threading capabilities.
  • Default excel reading engine is xlsx2csv engine which convert the xlsx into a csv file in memeory and call pl.read_csv.
  • It also offers calamine engine from Polar 0.20.6.

pip install -U polars        
import polars as pl

polar_excel = []
polar_calamine = []

for _ in range(10):
    polar_excel.append(get_read_time(pl.read_excel, "Coffee Shop Sales.xlsx"))

for _ in range(10):
    polar_calamine.append(get_read_time(pl.read_excel, "Coffee Shop Sales.xlsx", engine="calamine"))        

CSV Conversion (for Frequent Reads)

While Calamine and Polar can offer high speed for reading xlsx files, they are nice if we are reading the xlsx file one or two times, but let us imagine a scinario where we had to read the xlsx file multiple times for different scripts "which is actually a real world challange I'm facing in my daily working life". For such case, we need to store the xlsx file into a csv or parquet file to save those few seconds of reading and ensurre our scripts are running very fast.

After converting the file into csv we can easily call pd.read_csv to read the data into a DataFrame using the normal python engine or pyarrow which will allow the pd.read_csv to infer the data type of the columns:

pandas_csv = []
pyarrow_csv = []

for _ in range(10):
    pandas_csv.append(get_read_time(pd.read_csv, "Coffee Shop Sales.csv"))

for _ in range(10):
    pyarrow_csv.append(get_read_time(pd.read_csv, "Coffee Shop Sales.csv", engine="pyarrow"))        

Results

Average Reading time

Slow

Well as the results show, Native Openpyxl is the slowest way to read the xlsx file with 33.83 seconds. Polar default engine takes is 21.5 seconds which is 4 seconds faster than using Pandas, which might not be a big deal for our data, but if we scaled the read to a one million raw data, it might be faster by few additional minutes.

Fast

Using calamine engine is where things are moving from slow to fast. We can observe around 6x times improve when using Pandas calamine instead of the default Openpyxl and 8x faster Polar reading. Polar is also 65% faster than Pandas when both uses Calamine engine.

Fastest

Of course converting an excel file into a csv in Desk and then reading the csv file is faster than all previous method used for reading xlsx, but this approach is only suitable if we are planning to read the data multiple times for different automation or we are planning to have a routine run (maybe daily or even hourly).

What to use

The optimal method hinges on your specific use case. Here's a practical decision-making guide:

  1. One-Time or Infrequent Reads: If you only need to read the Excel file once or twice, consider either Pandas with the Calamine engine or Polar with the Calamine engine. The choice often depends on your preferred DataFrame library.
  2. Multiple Scripts or Scheduled Reads: When dealing with repeated reads across multiple scripts or scheduled tasks, pre-converting the Excel file to CSV is often the most efficient strategy. This assumes the Excel file remains relatively static. If the file updates frequently, consider implementing a script to automate conversion before each read.
  3. CSV Engine Selection: For CSV reads, leverage the PyArrow engine whenever possible. PyArrow excels in reading speed and accurately infers column data types, streamlining your analysis process.



要查看或添加评论,请登录

Amged Elsheikh的更多文章

社区洞察

其他会员也浏览了