Using Python to Handle Large Subsurface Dataset (10GB++)
Speed Comparison between Libraries

Using Python to Handle Large Subsurface Dataset (10GB++)

A non-technical attempt on trying to solve a problem arise from handling dataset larger than memory, about 10GB in size or more...

Borehole Image data, Wikipedia

Subsurface data comes in many forms, one of them is a borehole image data, that has dense data per foot measurement. This primarily helps identifying sedimentary features or secondary features such as fractures, which may exist in a tiny scales too small for normal measurement to see. That being said, it also make it incredibly large in file size (as it should) in the realms of GB, sometimes tens of Gigabytes.

Recently, I came across a challenge, where we need to plot borehole image measurement, into multiple plots using python script, which sounds like your everyday task - but, there is catch.

The files are in CSV file format, and the size is in tens of Gigabytes (10++).

The first script was using pandas to handle all sort of dataframe computation, from parsing the CSV file into data wrangling. Not ideal, but a start nonetheless.

However, it was clear from the beginning that the codes were not suitable for large computation involving GBs of data, especially in such an ineffective format like CSV files. One of the downsides are, the first codebase is not optimized the parsing using Chunks parameter.

In this article, I will try to show different ways of handling such large datasets using Python, and how do we go about optimizing the runtime/ parsing time, as it takes the longest.

You will also learn that using Polars (the fastest Dataframe library in Python) may not always be the best option in this situation.

This article is a short capture of my full article at Parsing 10+ GB CSV file in Python - Larger than Memory (adtarie.net)


Intro

We will have to create a dummy dataset, to replicate the process of parsing the CSV files. In general, we want to have three columns, one for datetime, and two for array.

If we recalled the first paragraph about image data, we will now know that image, is an array dataset. The problem if using CSV is, it will be store as "String", and a very taxing file type which take enormous storage compare to other format.

You may see the code in the full article, but below is the snapshot of the final CSV files we will be working on:

Dummy Datasets (image by author)

Another dataset (nyc) was also used in the test, as it has large file size too (10GB) - so similar in file size, but different columns format, data type, etc. Would be nice to see if file size is the sole indicator of how fast the parsing would be --- which if you read along, you will know it is not.

1- Pandas using Chunks

When parsing using de facto library like pandas, there is a nice way to optimize the parsing process. Imagine having a birthday cake, and if you were to eat the entire cake in one go - it would not be the easiest way to eat it.

However, if you eat it one slice per 10min, you may have it much more easier.

Same concept with Chunks. Basically what it means is to parse only partially, one parts at the time - not the entire csv. Remember, we are talking 10++ GB of CSV file here.

Codes for Pandas with Chunk

It finished 1GB csv in 10s, and 10+ GB in 105s, but failed when trying to parse nyc dataset.

Pandas with Chunk Runtime

2- Duckdb

Duckdb is one of the new kids in town (alongside Polar) with different approach to Polars. It bases on SQL, a language most database users are familiar with, and it has one of the fastest CSV reader from my test, it reads nyc dataset in just 0.26s - super impressive.

Duckdb code
Duckdb runtime

However, if we have a look at the full test, it is SLOWER than pandas in two dummy dataset, especially for 10+ GB csv file.

  • Duckdb 1GB CSV file -> 128s, whereas Pandas is 105s -> 20% faster
  • Duckdb 10GB CSV file -> 53s, whereas Pandas is 10s -> 500% faster

Now, why is that? My uneducated guess is because the dummy dataset has array stored as string- make it slower in parsing, whereas the nyc dataset even though large file size, the column are well formatted - make it easier for the Duckdb to parse.

Pandas with chunk is just more optimized compared to Duckdb hence - faster.

3- Polars

Here comes my favorite dataframe library, Polars. There is only one caveat, there is a setting when dealing with large dataset as shown below.

Error
Polars code

Polars run 1GB csv 7s, and 10GB is around 116s. Which is not the fastest, still slower than pandas especially the 10GB file.

The 1GB file is slightly faster than pandas (10s), by 3s. Not sure if that is significant enough to change your codebase to Polars.

Polars runtime

However, the nyc dataset is successfully parsed in Polars, while Pandas failed. So for more complex CSV file, it is better to run on Polars, as opposed to Pandas with Chunks.

4- Dask

Dask is another "pandas" on steroid, it uses multi-threaded processing to speed up the process. TLDR, it is also fast, however there are too many downsides (IMO) to make the setting rights (dtype) as shown below.

Dask runtime

Summary

In layman terms, you did not really need Polars/Duckdb to handle large dataset (10++GB file). Especially if Pandas can be optimized with Chunks, and especially for non-complex dataset.

PS: If you can use Duckdb, use it - it works really good especially for well-formatted file.

For everything else - use Polars

Diagram Tree for Library selection
Runtime benchmarking


Thanks for reading the article, I hope you can get something out of this - as much as I did.


Till next time...

Cheers

A.A.W.

Zahid Hamidi, PMP, P.Tech

Subsurface Data Scientist | Digital Energy Mentor | Well Intervention | Machine Learning & Automation

3 个月

Good to know! How about using Spark? Haven't tried it personally but I hope it's faster though

要查看或添加评论,请登录

社区洞察

其他会员也浏览了