Using Python to Handle Large Subsurface Dataset (10GB++)

Aditya Arie Wijaya

Geoscientist | Petrophysicist | Data

发布日期: 2024年6月18日

A non-technical attempt on trying to solve a problem arise from handling dataset larger than memory, about 10GB in size or more...

Subsurface data comes in many forms, one of them is a borehole image data, that has dense data per foot measurement. This primarily helps identifying sedimentary features or secondary features such as fractures, which may exist in a tiny scales too small for normal measurement to see. That being said, it also make it incredibly large in file size (as it should) in the realms of GB, sometimes tens of Gigabytes.

Recently, I came across a challenge, where we need to plot borehole image measurement, into multiple plots using python script, which sounds like your everyday task - but, there is catch.

The files are in CSV file format, and the size is in tens of Gigabytes (10++).

The first script was using pandas to handle all sort of dataframe computation, from parsing the CSV file into data wrangling. Not ideal, but a start nonetheless.

However, it was clear from the beginning that the codes were not suitable for large computation involving GBs of data, especially in such an ineffective format like CSV files. One of the downsides are, the first codebase is not optimized the parsing using Chunks parameter.

In this article, I will try to show different ways of handling such large datasets using Python, and how do we go about optimizing the runtime/ parsing time, as it takes the longest.

You will also learn that using Polars (the fastest Dataframe library in Python) may not always be the best option in this situation.

This article is a short capture of my full article at Parsing 10+ GB CSV file in Python - Larger than Memory (adtarie.net)

Intro

We will have to create a dummy dataset, to replicate the process of parsing the CSV files. In general, we want to have three columns, one for datetime, and two for array.

If we recalled the first paragraph about image data, we will now know that image, is an array dataset. The problem if using CSV is, it will be store as "String", and a very taxing file type which take enormous storage compare to other format.

You may see the code in the full article, but below is the snapshot of the final CSV files we will be working on:

Another dataset (nyc) was also used in the test, as it has large file size too (10GB) - so similar in file size, but different columns format, data type, etc. Would be nice to see if file size is the sole indicator of how fast the parsing would be --- which if you read along, you will know it is not.

1- Pandas using Chunks

When parsing using de facto library like pandas, there is a nice way to optimize the parsing process. Imagine having a birthday cake, and if you were to eat the entire cake in one go - it would not be the easiest way to eat it.

However, if you eat it one slice per 10min, you may have it much more easier.

Same concept with Chunks. Basically what it means is to parse only partially, one parts at the time - not the entire csv. Remember, we are talking 10++ GB of CSV file here.

It finished 1GB csv in 10s, and 10+ GB in 105s, but failed when trying to parse nyc dataset.

2- Duckdb

Duckdb is one of the new kids in town (alongside Polar) with different approach to Polars. It bases on SQL, a language most database users are familiar with, and it has one of the fastest CSV reader from my test, it reads nyc dataset in just 0.26s - super impressive.

Benjamin Bennett Alexander 1 年前

Effortless Data Tabulation in Python with Tabulate…

360DigiTMG 1 年前

Introduction to NumPy

Rany ElHousieny, PhD??? 1 年前

However, if we have a look at the full test, it is SLOWER than pandas in two dummy dataset, especially for 10+ GB csv file.

Duckdb 1GB CSV file -> 128s, whereas Pandas is 105s -> 20% faster
Duckdb 10GB CSV file -> 53s, whereas Pandas is 10s -> 500% faster

Now, why is that? My uneducated guess is because the dummy dataset has array stored as string- make it slower in parsing, whereas the nyc dataset even though large file size, the column are well formatted - make it easier for the Duckdb to parse.

Pandas with chunk is just more optimized compared to Duckdb hence - faster.

3- Polars

Here comes my favorite dataframe library, Polars. There is only one caveat, there is a setting when dealing with large dataset as shown below.

Polars run 1GB csv 7s, and 10GB is around 116s. Which is not the fastest, still slower than pandas especially the 10GB file.

The 1GB file is slightly faster than pandas (10s), by 3s. Not sure if that is significant enough to change your codebase to Polars.

However, the nyc dataset is successfully parsed in Polars, while Pandas failed. So for more complex CSV file, it is better to run on Polars, as opposed to Pandas with Chunks.

4- Dask

Dask is another "pandas" on steroid, it uses multi-threaded processing to speed up the process. TLDR, it is also fast, however there are too many downsides (IMO) to make the setting rights (dtype) as shown below.

Summary

In layman terms, you did not really need Polars/Duckdb to handle large dataset (10++GB file). Especially if Pandas can be optimized with Chunks, and especially for non-complex dataset.

PS: If you can use Duckdb, use it - it works really good especially for well-formatted file.

For everything else - use Polars

Thanks for reading the article, I hope you can get something out of this - as much as I did.

Till next time...

Cheers

A.A.W.

Zahid Hamidi, PMP, P.Tech

Subsurface Data Scientist | Digital Energy Mentor | Well Intervention | Machine Learning & Automation

3 个月

Good to know! How about using Spark? Haven't tried it personally but I hope it's faster though

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Using Python to Handle Large Subsurface Dataset (10GB++)

Aditya Arie Wijaya

Geoscientist | Petrophysicist | Data

Intro

1- Pandas using Chunks

2- Duckdb

领英推荐

3- Polars

4- Dask

Summary

更多精彩文章

社区洞察

其他会员也浏览了

Modular Markov Chain Monte Carlo in Python

The Role of Python in Data Science: A Comprehensive Overview

Get Sentinel Data within seconds in?Python

Building a weather forecasting system using a Raspberry Pi, machine learning, Python, React, cloud services, Kafka, and Docker deployment

Basics of NumPy

Python geotechTools on GitHub

Inbuilt Data Structures in Python

Comprehensive Guide to Pandas DataFrame Row Operations

Python gridData in geotechTools

Python Dictionary Methods

Intro

1- Pandas using Chunks

2- Duckdb

领英推荐

3- Polars

4- Dask

Summary

First Step to A Decennium - 5 Things I Learned About Career Development

2024年4月13日

Three Reasons Why I Think We Need More Open-Source Geoscientist Software.

2022年7月29日

How Do I get a Chance for Interview via Networking - Working at Fortune 500 Company

2022年7月7日

Web App Triple Combo Plotter

2021年10月27日

Plotting Triple Combo Data from LAS file in Python-The Easiest Way.

2021年10月22日

Grain Size and Sorting. Which one is affecting Permeability the Most?

2021年9月18日

Iteration in Petrophysics- Getting the Correct Density Porosity.

2021年1月4日

"The Absence of Evidence is not the Evidence of Absence"- Does the Test Result tells the Full Story?

2020年6月18日

The Age before Archie’s Equation- A History

2019年10月10日

Fundamental Difference between Petrophysicist and Geologist

2019年9月14日

社区洞察

其他会员也浏览了

Modular Markov Chain Monte Carlo in Python

The Role of Python in Data Science: A Comprehensive Overview

Get Sentinel Data within seconds in?Python

Building a weather forecasting system using a Raspberry Pi, machine learning, Python, React, cloud services, Kafka, and Docker deployment

Basics of NumPy

Python geotechTools on GitHub

Inbuilt Data Structures in Python

Comprehensive Guide to Pandas DataFrame Row Operations

Python gridData in geotechTools

Python Dictionary Methods