Do You Read Excel Files with Python? There is a 1000x Faster?Way
Image Created By Author

Do You Read Excel Files with Python? There is a 1000x Faster?Way

As a Python user, I frequently interact with Excel files to manage data because business professionals often prefer sharing information in Excel or CSV formats. However, Python can be notably slow when handling Excel files.

In this article, I will introduce five methods for loading data in Python. By the end of this discussion, we will have achieved a speed increase of three orders of magnitude, making the process incredibly fast.

Experimental Setup

Suppose we need to load 10 Excel files, each containing 20,000 rows and 25 columns, totaling approximately 70MB. This scenario is typical when you’re transferring transactional data from an ERP system like SAP into Python for analysis.

First, we’ll generate this dummy data and set up our environment by importing the necessary libraries (we’ll delve into the specifics of using libraries like pickle and joblib later in the article).

import pandas as pd
import numpy as np
from joblib import Parallel, delayed
import time

for file_number in range(10):
    values = np.random.uniform(size=(20000,25))
    pd.DataFrame(values).to_csv(f"Dummy {file_number}.csv")
    pd.DataFrame(values).to_excel(f"Dummy {file_number}.xlsx")
    pd.DataFrame(values).to_pickle(f"Dummy {file_number}.pickle")        

5 Ways to Load Data in?Python

1: Loading Excel Files Using?Pandas

We’ll begin with the simplest method for loading Excel files. We’ll initialize a Pandas DataFrame and sequentially append each Excel file to it. This approach provides a clear and direct way to compile data from multiple sources into a single structure for analysis.

start = time.time()
df = pd.read_excel("Dummy 0.xlsx")
for file_number in range(1,10):
    df.append(pd.read_excel(f"Dummy {file_number}.xlsx"))
end = time.time()
print("Excel:", end - start)        

It takes around 50 seconds to run. Pretty slow.

2: Opt for CSVs Instead of Excel?Files

Consider converting our data files from?.xlsx to?.csv format before importing them from our ERP system, such as SAP. This change can significantly improve the speed and efficiency of data loading in Python.

start = time.time()
df = pd.read_csv("Dummy 0.csv")
for file_number in range(1,10):
    df.append(pd.read_csv(f"Dummy {file_number}.csv"))
end = time.time()
print("CSV:", end - start)        

2: Opt for CSVs Instead of Excel?Files

After converting our Excel files to CSV format, the loading time significantly decreases to just 0.63 seconds, nearly 10 times faster than before. Python typically processes CSV files much quicker than Excel files, often up to 100 times faster. Thus, using CSVs can be a highly efficient strategy for handling large datasets.

However, a notable drawback is that CSV files usually have larger file sizes compared to?.xlsx files. For instance, in our example, the CSV files are 9.5MB each, whereas the?.xlsx files are only 6.4MB.

3: Smarter Creation of Pandas DataFrames

To enhance our data loading process further, we can optimize how we create our Pandas DataFrames. Instead of appending each file directly to an existing DataFrame, which can be time-consuming:

1. We load each Excel or CSV file into a separate DataFrame. 2. We store these DataFrames in a list. 3. Finally, we concatenate all the DataFrames in the list into a single DataFrame.

This method is typically faster than appending each file incrementally because it reduces the overhead associated with repeatedly expanding the DataFrame.

start = time.time()
df = []
for file_number in range(10):
    temp = pd.read_csv(f"Dummy {file_number}.csv")
    df.append(temp)
df = pd.concat(df, ignore_index=True)
end = time.time()
print("CSV2:", end - start)        

We achieved a slight reduction in load time. From my experience, this technique proves especially beneficial when handling larger DataFrames, typically those well over 100MB in size.

4: Parallelize CSV Imports with?Joblib

To further expedite the loading process, consider parallelizing the import of multiple files. Instead of loading each of the 10 files sequentially, we can load them simultaneously in parallel.?

This approach leverages the capabilities of the joblib library, which simplifies parallel processing in Python. By using joblib, we can distribute the file loading tasks across multiple cores or threads, significantly reducing the overall load time.

start = time.time()
def loop(file_number):
    return pd.read_csv(f"Dummy {file_number}.csv")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("CSV//:", end - start)        

We managed to double the speed compared to the single-core version. However, it’s important to note that increasing the number of cores does not linearly scale performance. For example, using 8 cores on a Mac Air with the M1 chip, I observed a twofold speed increase, not eightfold.

Simple Parallelization in Python with?Joblib

Joblib is a straightforward Python library designed for parallel processing. It operates similarly to a list comprehension but with a crucial difference: each iteration is executed on a separate thread. This approach allows for simultaneous processing of tasks. Here’s how you can implement it:

def loop(file_number):
    return pd.read_csv(f"Dummy {file_number}.csv")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))

#equivalent to
df = [loop(file_number) for file_number in range(10)]        

5: Utilize Pickle?Files

For an even faster data storage and retrieval process, consider using pickle files. Pickle is a Python-specific format designed for serializing and deserializing objects, allowing for quicker data loading compared to?.csv files.

However, a significant downside is that pickle files are not human-readable. Unlike?.csv files, you cannot open a pickle file in a text editor or spreadsheet program to view or edit the contents directly.

start = time.time()
def loop(file_number):
    return pd.read_pickle(f"Dummy {file_number}.pickle")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("Pickle//:", end - start)        

We’ve successfully reduced the processing time by 80%!

Overall, working with pickle files is much faster than using?.csv files. However, pickle files typically occupy more storage space on your hard drive (though not in this particular example).

In reality, systems do not usually export data directly in pickle format. I recommend using pickle files in the following scenarios:

1. For Internal Python Use: If you’re saving data from a Python process and do not need to open it in Excel or other non-Python environments, store your DataFrames as pickle files. This is ideal for data that you intend to reuse within Python scripts or applications.

2. For Frequent File Access: If you repeatedly load the same file(s), it’s efficient to save it as a pickle after the initial load. Future processes can directly load from the pickle file, bypassing the slower?.csv load process.

Example: For managing transactional data that updates monthly, you can convert each month’s data from?.csv to?.pickle after the first load. This allows you to quickly access historical data in pickle format, streamlining your workflow each month as new data arrives.

Bonus: Loading Excel Files in?Parallel

If you find yourself in a situation where you’ve received Excel files and must work with them directly, parallel processing can still be applied to improve efficiency. You can use the joblib library to load these files in parallel, just as you would with other formats.

To implement this, you’ll need to adjust the function within your loop to handle Excel files specifically. This modification involves using a function designed to load Excel files, and then distributing these tasks across multiple processors using joblib. This approach can significantly cut down the time it takes to load multiple Excel files simultaneously.

start = time.time()
def loop(file_number):
    return pd.read_excel(f"Dummy {file_number}.xlsx")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("Excel//:", end - start)        

We managed to reduce the loading time by 70%, cutting it down from 50 seconds to just 13 seconds.

Additionally, you can utilize this parallel loading process to create pickle files on the fly. By doing so, the next time you need to load these files, you can leverage the significantly faster loading times offered by pickle files, ensuring near-instantaneous access to your data. This method not only optimizes the initial loading process but also streamlines any future interactions with the same datasets.

Recap

Through various methods of data loading, we’ve significantly improved efficiency, decreasing the time it takes to handle large datasets:

  • Excel Files: Initially took 50 seconds to load.
  • CSV Files: Improved to 0.63 seconds.
  • Smarter CSV Loading: Further refined to 0.62 seconds.
  • Parallel CSV Loading: Reduced to 0.34 seconds.
  • Parallel Pickle Loading: Drastically cut down to 0.07 seconds, less than a tenth of a second.
  • Parallel Excel Loading: Brought down to 13.5 seconds.

These results highlight the benefits of optimizing file formats and leveraging parallel processing to enhance the performance of data loading operations in Python.

Michael Perry

Senior Technology Manager (Enterprise Mobility Engineer) at Sam's Club

1 个月

Should I trust the technical skills of anyone who thinks that speeding up a process that took "around 50 seconds down to 0.63 seconds is "nearly 10 times faster"? My basic math skills tells me that 50 seconds divided by 0.63 seconds comes out to 79.3650793650793650793650793650 or in other words, nearly 80 times faster.

回复

要查看或添加评论,请登录

Parvez Shah Shaik的更多文章

社区洞察

其他会员也浏览了