Reading XLSX Files Quickly in Python
This article dives into efficient methods for reading XLSX files into Python DataFrames. While Pandas offers a powerful set of tools, reading large files can become a bottleneck. We'll explore faster alternatives to streamline your data analysis process.
Introduction
XLSX files, the cornerstone of Microsoft Excel, hold a vital role in data analysis. They store and organize information, making them a go-to tool for anyone working with numbers. However, reading these files into a Pandas DataFrame can encounter performance issues.
We'll compare libraries based on reading times using the get_read_time function that measures execution time. This ensures a fair comparison across libraries we will read the file 10 for each method and get the average run time.
import time
from typing import Callable
def get_read_time(func: Callable, *args, **kwargs):
start_time = time.time()
func(*args, **kwargs)
period = time.time() - start_time
return period
Approaches
Openpyxl
from pathlib import Path
import pandas as pd
import openpyxl
def read_openpyxl(f: Path | str):
wb = openpyxl.load_workbook(f)
ws = wb.active
data = ws.values
cols = next(data)
pd.DataFrame(data, columns=cols)
wb.close()
openpyxl_excel = []
for _ in range(10):
for _ in range(10):
openpyxl_excel.append(get_read_time(read_openpyxl, excel_file))
Calamine
pip install python-calamine
from python_calamine import CalamineWorkbook
def read_calamine(f: str | Path):
data = CalamineWorkbook.from_path(f)
data = data.get_sheet_by_index(0).to_python()
df = pd.DataFrame(data[1:], columns=data[0])
return df
Pandas
pip install -U pandas
openpyxl_engine = []
calamine_engine = []
for _ in range(10):
openpyxl_engine.append(get_read_time(pd.read_excel, "Coffee Shop Sales.xlsx", engine="openpyxl"))
for _ in range(10):
calamine_engine.append(get_read_time(pd.read_excel, "Coffee Shop Sales.xlsx", engine="calamine"))
Polar
领英推荐
pip install -U polars
import polars as pl
polar_excel = []
polar_calamine = []
for _ in range(10):
polar_excel.append(get_read_time(pl.read_excel, "Coffee Shop Sales.xlsx"))
for _ in range(10):
polar_calamine.append(get_read_time(pl.read_excel, "Coffee Shop Sales.xlsx", engine="calamine"))
CSV Conversion (for Frequent Reads)
While Calamine and Polar can offer high speed for reading xlsx files, they are nice if we are reading the xlsx file one or two times, but let us imagine a scinario where we had to read the xlsx file multiple times for different scripts "which is actually a real world challange I'm facing in my daily working life". For such case, we need to store the xlsx file into a csv or parquet file to save those few seconds of reading and ensurre our scripts are running very fast.
After converting the file into csv we can easily call pd.read_csv to read the data into a DataFrame using the normal python engine or pyarrow which will allow the pd.read_csv to infer the data type of the columns:
pandas_csv = []
pyarrow_csv = []
for _ in range(10):
pandas_csv.append(get_read_time(pd.read_csv, "Coffee Shop Sales.csv"))
for _ in range(10):
pyarrow_csv.append(get_read_time(pd.read_csv, "Coffee Shop Sales.csv", engine="pyarrow"))
Results
Slow
Well as the results show, Native Openpyxl is the slowest way to read the xlsx file with 33.83 seconds. Polar default engine takes is 21.5 seconds which is 4 seconds faster than using Pandas, which might not be a big deal for our data, but if we scaled the read to a one million raw data, it might be faster by few additional minutes.
Fast
Using calamine engine is where things are moving from slow to fast. We can observe around 6x times improve when using Pandas calamine instead of the default Openpyxl and 8x faster Polar reading. Polar is also 65% faster than Pandas when both uses Calamine engine.
Fastest
Of course converting an excel file into a csv in Desk and then reading the csv file is faster than all previous method used for reading xlsx, but this approach is only suitable if we are planning to read the data multiple times for different automation or we are planning to have a routine run (maybe daily or even hourly).
What to use
The optimal method hinges on your specific use case. Here's a practical decision-making guide: