登录查看更多内容

Reading XLSX Files Quickly in Python

Amged Elsheikh

Software Engineer | Data Engineer | M.Eng Robotics

发布日期: 2024年4月7日

This article dives into efficient methods for reading XLSX files into Python DataFrames. While Pandas offers a powerful set of tools, reading large files can become a bottleneck. We'll explore faster alternatives to streamline your data analysis process.

Introduction

XLSX files, the cornerstone of Microsoft Excel, hold a vital role in data analysis. They store and organize information, making them a go-to tool for anyone working with numbers. However, reading these files into a Pandas DataFrame can encounter performance issues.

We'll compare libraries based on reading times using the get_read_time function that measures execution time. This ensures a fair comparison across libraries we will read the file 10 for each method and get the average run time.

import time
from typing import Callable


def get_read_time(func: Callable, *args, **kwargs):
    start_time = time.time()
    func(*args, **kwargs)
    period = time.time() - start_time
    return period

Approaches

Openpyxl

Suitable for smaller datasets.
Slower performance compared to other methods.
Requires many lines of code (not elegant)

from pathlib import Path

import pandas as pd
import openpyxl


def read_openpyxl(f: Path | str):
    wb = openpyxl.load_workbook(f)
    ws = wb.active
    data = ws.values
    cols = next(data)
    pd.DataFrame(data, columns=cols)
    wb.close()

openpyxl_excel = []
for _ in range(10):
    for _ in range(10):
    openpyxl_excel.append(get_read_time(read_openpyxl, excel_file))

Calamine

A fast XLSX reader written in Rust.
Relatively new and offers significant speed improvements.

pip install python-calamine

from python_calamine import CalamineWorkbook


def read_calamine(f: str | Path):
    data = CalamineWorkbook.from_path(f)
    data = data.get_sheet_by_index(0).to_python()
    df = pd.DataFrame(data[1:], columns=data[0])
    return df

Pandas

One of the most popular libraries among Data Engineers and Data Scientist.
Offer an easy API to read xlsx files using different engines.
Pandas intoduced calamine engine from Pandas 2.2

pip install -U pandas

openpyxl_engine = []
calamine_engine = []

for _ in range(10):
    openpyxl_engine.append(get_read_time(pd.read_excel, "Coffee Shop Sales.xlsx", engine="openpyxl"))

for _ in range(10):
    calamine_engine.append(get_read_time(pd.read_excel, "Coffee Shop Sales.xlsx", engine="calamine"))

Polar

High-performance DataFrame library based on Rust with multi-threading capabilities.
Default excel reading engine is xlsx2csv engine which convert the xlsx into a csv file in memeory and call pl.read_csv.
It also offers calamine engine from Polar 0.20.6.

Benjamin Bennett Alexander 4 个月前

Data Analysis With Python: 5 pandas Column Operations…

Benjamin Bennett Alexander 10 个月前

Top 7 Python Libraries for Data Automation

Muhammad Ishtiaq Khan 3 个月前

pip install -U polars

import polars as pl

polar_excel = []
polar_calamine = []

for _ in range(10):
    polar_excel.append(get_read_time(pl.read_excel, "Coffee Shop Sales.xlsx"))

for _ in range(10):
    polar_calamine.append(get_read_time(pl.read_excel, "Coffee Shop Sales.xlsx", engine="calamine"))

CSV Conversion (for Frequent Reads)

While Calamine and Polar can offer high speed for reading xlsx files, they are nice if we are reading the xlsx file one or two times, but let us imagine a scinario where we had to read the xlsx file multiple times for different scripts "which is actually a real world challange I'm facing in my daily working life". For such case, we need to store the xlsx file into a csv or parquet file to save those few seconds of reading and ensurre our scripts are running very fast.

After converting the file into csv we can easily call pd.read_csv to read the data into a DataFrame using the normal python engine or pyarrow which will allow the pd.read_csv to infer the data type of the columns:

pandas_csv = []
pyarrow_csv = []

for _ in range(10):
    pandas_csv.append(get_read_time(pd.read_csv, "Coffee Shop Sales.csv"))

for _ in range(10):
    pyarrow_csv.append(get_read_time(pd.read_csv, "Coffee Shop Sales.csv", engine="pyarrow"))

Results

Slow

Well as the results show, Native Openpyxl is the slowest way to read the xlsx file with 33.83 seconds. Polar default engine takes is 21.5 seconds which is 4 seconds faster than using Pandas, which might not be a big deal for our data, but if we scaled the read to a one million raw data, it might be faster by few additional minutes.

Fast

Using calamine engine is where things are moving from slow to fast. We can observe around 6x times improve when using Pandas calamine instead of the default Openpyxl and 8x faster Polar reading. Polar is also 65% faster than Pandas when both uses Calamine engine.

Fastest

Of course converting an excel file into a csv in Desk and then reading the csv file is faster than all previous method used for reading xlsx, but this approach is only suitable if we are planning to read the data multiple times for different automation or we are planning to have a routine run (maybe daily or even hourly).

What to use

The optimal method hinges on your specific use case. Here's a practical decision-making guide:

One-Time or Infrequent Reads: If you only need to read the Excel file once or twice, consider either Pandas with the Calamine engine or Polar with the Calamine engine. The choice often depends on your preferred DataFrame library.
Multiple Scripts or Scheduled Reads: When dealing with repeated reads across multiple scripts or scheduled tasks, pre-converting the Excel file to CSV is often the most efficient strategy. This assumes the Excel file remains relatively static. If the file updates frequently, consider implementing a script to automate conversion before each read.
CSV Engine Selection: For CSV reads, leverage the PyArrow engine whenever possible. PyArrow excels in reading speed and accurately infers column data types, streamlining your analysis process.

Amged Elsheikh的更多文章

Finding a Suitable Location to Open a new Cafe/Coffee Shop in New York City using Airbnb Data

2021年6月4日

Finding a Suitable Location to Open a new Cafe/Coffee Shop in New York City using Airbnb Data

Before you start reading, I will encourage you to download this project notebook from my GitHub repository. Lastly…

Reading XLSX Files Quickly in Python

Amged Elsheikh

Software Engineer | Data Engineer | M.Eng Robotics

Introduction

Approaches

Openpyxl

Calamine

Pandas

Polar

领英推荐

CSV Conversion (for Frequent Reads)

Results

Slow

Fast

Fastest

What to use

Amged Elsheikh的更多文章

社区洞察

其他会员也浏览了

Python Libraries for Data Clean-Up

Integrating Python with Power BI

Microsoft Excel + Python Integration: A Game-Changer for Data Analysts & Scientist!

Python is coming to Excel: Unleashing the powers of both worlds

Top 5 Uses of Python in Power BI

Implementing Real-Time Data Analysis with Python and Pandas: A Comprehensive Guide

Introduction to Pandas

Revolutionizing Data Analysis: How Python Integration with Excel Empowers Data Analysts

High-Performance Data Analysis with Polars: A Comprehensive Guide

Google Analytics Data Analysis With Python And Data Studio

Introduction

Approaches

Openpyxl

Calamine

Pandas

Polar

领英推荐

CSV Conversion (for Frequent Reads)

Results

Slow

Fast

Fastest

What to use

Amged Elsheikh的更多文章

Finding a Suitable Location to Open a new Cafe/Coffee Shop in New York City using Airbnb Data

社区洞察

其他会员也浏览了

Python Libraries for Data Clean-Up

Integrating Python with Power BI

Microsoft Excel + Python Integration: A Game-Changer for Data Analysts & Scientist!

Python is coming to Excel: Unleashing the powers of both worlds

Top 5 Uses of Python in Power BI

Implementing Real-Time Data Analysis with Python and Pandas: A Comprehensive Guide

Introduction to Pandas

Revolutionizing Data Analysis: How Python Integration with Excel Empowers Data Analysts

High-Performance Data Analysis with Polars: A Comprehensive Guide

Google Analytics Data Analysis With Python And Data Studio