登录查看更多内容

DataScience: Handling Big Data

Jonathan Kinlay

Head of Consulting Practice at Intelligent Technologies

发布日期: 2021年9月3日

Handling Large Files in CSV format with NumPy and Pandas

One of the major challenges that users face when trying to do data science is how to handle big data. Leaving aside the important topic of database connectivity/functionality and the handling of data too large to fit in memory, my concern here is with the issue of how to handle large data files, which are often in csv format, but which are not too large to fit into available memory.

It is well known that, due to their generality, Mathematica's Import and Export functions are horribly slow when handling large csv files. For example, writing out a list of 10 million 64-bit reals takes almost 5 minutes:

and reading is also unacceptably slow:

Performance results like these create the impression that Mathematica is suitable for handling only "toy" problems, rather than the kind of large and complex data challenges faced by data scientists in the real world.

Sure, you can speed this up with ReadLine, but not by much, after doing all the string processing. And while the mx binary file format speeds up data handling enormously, it doesn't address the issue of how to get the data into the requisite file format, other than via the WL DumpSave function - in other words, the data already has to be in a Mathematica notebook in order to write an mx file.

With purely numerical data once way to address this by using non-proprietary binary file formats. For example, in Python we create a NumPy array and use the tofile() method to output the data in real64 binary format, in less then 2 seconds:

Then in Mathematica the read process is equally fast when processing a file of numerical data in binary format, around 50x faster than the time taken to process the same file in csv format:

The procedure is just as fast in the reverse direction, with binary data exports from Mathematica taking a fraction of the time required to process the same data in csv format (around 200x faster!):

领英推荐

Dataprep - An Auto_EDA library

360DigiTMG 1 年前

Data Analysis and Visualization with Pandas and…

Free Online Courses With Certificates 9 个月前

Understanding Pandas DataFrames: A Complete Guide with…

ITVersity, Inc. 1 个月前

And the data is extremely fast read back in Python using the numpy fromfile method:

This procedure is robust enough to accommodate missing data. For instance, let's replace some of the values in our data array with np.nan values and export the file once again in binary format:

Reading the binary file into Mathematica, we find no reduction in speed, as the np.nan values are stored as decimals, which are replaced by the value Indeterminate in the imported Mathematica array:

So, for purely numerical data we have a fast and reliable procedure for transferring data between Python, R and Mathematica using binary format. This means that we can load very large csv files in Python, do some pre-processing in pandas and export the massaged data in binary format for further analysis in Mathematica, if required.

More Complex Data Structures: the HDF5 Format

A major step in the right direction has been achieved through the significant effort that WR has put into implementing the HDF5 binary file format standard in the Wolfram Language. This serves two purposes: firstly, it can speed up the storage and retrieval of large datasets, by orders of magnitudes (depending on the data type); secondly, unlike Wolfram's proprietary mx file format, HDF5 is an open source format that can store large, complex & hierarchical datasets that are accessible via Python, R and MatLab, as well as other languages/platforms, including Mathematica. So, working with the same dataset as before, but using HDF5 format, we get an speed-up of around 500x on the file write and around 270x on the file read:

Another major benefit of working in binary format is the enormous saving in disk storage, compared to csv:

So it becomes feasible to envisage a workflow in which some pre-processing of a very large dataset in csv format takes place initially in e.g. Python Pandas, the results of which are exported to a HDF5 or binary format file for further processing in Mathematica.

This advance does a great deal to address some of the major concerns about using Mathematica for large data science projects. And I am not sure that users are necessarily aware of its significance, given all the hoopla over more glamorous features that tend to get all the attention in new version releases.

要查看或添加评论，请登录

Jonathan Kinlay的更多文章

Culture Wars and the Sanction of the Victim

2024年1月1日

Culture Wars and the Sanction of the Victim

"Why can’t they write new stuff instead of mutilating our cultural heritage to suit their warped progressive ideology?"…

1 条评论
AI on Trial:

2023年12月28日

AI on Trial:

Generative Models Face Copyright Clash As artificial intelligence rapidly advances, so too do complex questions around…
Harvard, DEI and the Politics of Mediocrity - A Critique

2023年12月20日

Harvard, DEI and the Politics of Mediocrity - A Critique

The Harvard DEI Scandal A brief summary of the scandal involving Harvard president Claudine Gay and DEI: Claudine Gay…

100 条评论
Advancements in Room Temperature Semiconductive Material

2023年8月2日

Advancements in Room Temperature Semiconductive Material

Transforming the Semiconductor Industry The semiconductor industry, a critical pillar of modern electronics, is set to…

1 条评论
Calendar Effects in Equity Index Returns

2023年7月17日

Calendar Effects in Equity Index Returns

A follow-up to this post. An interesting question was raised by a reader: are there any other pairs of months for which…
Educators Worry About Students Cheating with ChatGPT. They Needn't.

2023年1月5日

Educators Worry About Students Cheating with ChatGPT. They Needn't.

You would have to have been living under a rock not to have noticed the hoopla over the launch of ChatGPT, OpenAI's…
Why Technical Analysis Doesn't Work

2023年1月2日

Why Technical Analysis Doesn't Work

Single Stock Analytics Generally speaking, one of the major attractions of working in the equities space is that the…
Bitcoin - Lessons from the Trenches

2022年6月20日

Bitcoin - Lessons from the Trenches

Back in November 2021, when Bitcoin was at $68,000, I was working with a client on a hedged cryptocurrency product to…
Why is Nobody Talking About Japan?

2020年5月4日

Why is Nobody Talking About Japan?

Almost every newspaper article and news bulletin on TV and radio is replete with updates on how the governments of…

6 条评论
Does Japan Hold the Key to Corona Virus Containment?

2020年3月23日

Does Japan Hold the Key to Corona Virus Containment?

While recently browsing Bing's excellent Covid-19 tracker app (see here) I was struck by the extremely low number of…

2 条评论

See all articles

DataScience: Handling Big Data

Jonathan Kinlay

Head of Consulting Practice at Intelligent Technologies

Handling Large Files in CSV format with NumPy and Pandas

领英推荐

More Complex Data Structures: the HDF5 Format

Jonathan Kinlay的更多文章

社区洞察

其他会员也浏览了

Introduction to Pandas Series and DataFrames: Building Blocks of Data Handling in Python

Python Challenge: User Activity Analysis

Handling Duplicates using Pandas DataFrames

Accessing Data with loc: Label-Based Indexing in Pandas

The Only Roadmap You’ll Ever Need for Data Science (2025)

20 Advanced Methods For Doing Data Analysis in Excel

Top 10 Tools or Applications or Libraries or Packages Used by Data Scientists in Day-to-Day Work and their mapping to Data Science Life Cycle in IT

Aggregation in Pandas DataFrame

Pandas for Data Science

Denver Real Estate Big Data: Coding

Handling Large Files in CSV format with NumPy and Pandas

领英推荐

More Complex Data Structures: the HDF5 Format

Jonathan Kinlay的更多文章

Culture Wars and the Sanction of the Victim

AI on Trial:

Harvard, DEI and the Politics of Mediocrity - A Critique

Advancements in Room Temperature Semiconductive Material

Calendar Effects in Equity Index Returns

Educators Worry About Students Cheating with ChatGPT. They Needn't.

Why Technical Analysis Doesn't Work

Bitcoin - Lessons from the Trenches

Why is Nobody Talking About Japan?

Does Japan Hold the Key to Corona Virus Containment?

社区洞察

其他会员也浏览了

Introduction to Pandas Series and DataFrames: Building Blocks of Data Handling in Python

Python Challenge: User Activity Analysis

Handling Duplicates using Pandas DataFrames

Accessing Data with loc: Label-Based Indexing in Pandas

The Only Roadmap You’ll Ever Need for Data Science (2025)

20 Advanced Methods For Doing Data Analysis in Excel

Top 10 Tools or Applications or Libraries or Packages Used by Data Scientists in Day-to-Day Work and their mapping to Data Science Life Cycle in IT

Aggregation in Pandas DataFrame

Pandas for Data Science

Denver Real Estate Big Data: Coding