DataScience:  Handling Big Data

DataScience: Handling Big Data

Handling Large Files in CSV format with NumPy and Pandas

One of the major challenges that users face when trying to do data science is how to handle big data. Leaving aside the important topic of database connectivity/functionality and the handling of data too large to fit in memory, my concern here is with the issue of how to handle large data files, which are often in csv format, but which are not too large to fit into available memory.

It is well known that, due to their generality, Mathematica's Import and Export functions are horribly slow when handling large csv files. For example, writing out a list of 10 million 64-bit reals takes almost 5 minutes:

No alt text provided for this image

and reading is also unacceptably slow:

No alt text provided for this image

Performance results like these create the impression that Mathematica is suitable for handling only "toy" problems, rather than the kind of large and complex data challenges faced by data scientists in the real world.

Sure, you can speed this up with ReadLine, but not by much, after doing all the string processing. And while the mx binary file format speeds up data handling enormously, it doesn't address the issue of how to get the data into the requisite file format, other than via the WL DumpSave function - in other words, the data already has to be in a Mathematica notebook in order to write an mx file.

With purely numerical data once way to address this by using non-proprietary binary file formats. For example, in Python we create a NumPy array and use the tofile() method to output the data in real64 binary format, in less then 2 seconds:

No alt text provided for this image

Then in Mathematica the read process is equally fast when processing a file of numerical data in binary format, around 50x faster than the time taken to process the same file in csv format:

No alt text provided for this image

The procedure is just as fast in the reverse direction, with binary data exports from Mathematica taking a fraction of the time required to process the same data in csv format (around 200x faster!):

No alt text provided for this image

And the data is extremely fast read back in Python using the numpy fromfile method:

No alt text provided for this image

This procedure is robust enough to accommodate missing data. For instance, let's replace some of the values in our data array with np.nan values and export the file once again in binary format:

No alt text provided for this image

Reading the binary file into Mathematica, we find no reduction in speed, as the np.nan values are stored as decimals, which are replaced by the value Indeterminate in the imported Mathematica array:

No alt text provided for this image

So, for purely numerical data we have a fast and reliable procedure for transferring data between Python, R and Mathematica using binary format. This means that we can load very large csv files in Python, do some pre-processing in pandas and export the massaged data in binary format for further analysis in Mathematica, if required.

More Complex Data Structures: the HDF5 Format

A major step in the right direction has been achieved through the significant effort that WR has put into implementing the HDF5 binary file format standard in the Wolfram Language. This serves two purposes: firstly, it can speed up the storage and retrieval of large datasets, by orders of magnitudes (depending on the data type); secondly, unlike Wolfram's proprietary mx file format, HDF5 is an open source format that can store large, complex & hierarchical datasets that are accessible via Python, R and MatLab, as well as other languages/platforms, including Mathematica. So, working with the same dataset as before, but using HDF5 format, we get an speed-up of around 500x on the file write and around 270x on the file read:

No alt text provided for this image

Another major benefit of working in binary format is the enormous saving in disk storage, compared to csv:

No alt text provided for this image

So it becomes feasible to envisage a workflow in which some pre-processing of a very large dataset in csv format takes place initially in e.g. Python Pandas, the results of which are exported to a HDF5 or binary format file for further processing in Mathematica.

This advance does a great deal to address some of the major concerns about using Mathematica for large data science projects. And I am not sure that users are necessarily aware of its significance, given all the hoopla over more glamorous features that tend to get all the attention in new version releases.

要查看或添加评论,请登录

Jonathan Kinlay的更多文章

  • Culture Wars and the Sanction of the Victim

    Culture Wars and the Sanction of the Victim

    "Why can’t they write new stuff instead of mutilating our cultural heritage to suit their warped progressive ideology?"…

    1 条评论
  • AI on Trial:

    AI on Trial:

    Generative Models Face Copyright Clash As artificial intelligence rapidly advances, so too do complex questions around…

  • Harvard, DEI and the Politics of Mediocrity - A Critique

    Harvard, DEI and the Politics of Mediocrity - A Critique

    The Harvard DEI Scandal A brief summary of the scandal involving Harvard president Claudine Gay and DEI: Claudine Gay…

    100 条评论
  • Advancements in Room Temperature Semiconductive Material

    Advancements in Room Temperature Semiconductive Material

    Transforming the Semiconductor Industry The semiconductor industry, a critical pillar of modern electronics, is set to…

    1 条评论
  • Calendar Effects in Equity Index Returns

    Calendar Effects in Equity Index Returns

    A follow-up to this post. An interesting question was raised by a reader: are there any other pairs of months for which…

  • Educators Worry About Students Cheating with ChatGPT. They Needn't.

    Educators Worry About Students Cheating with ChatGPT. They Needn't.

    You would have to have been living under a rock not to have noticed the hoopla over the launch of ChatGPT, OpenAI's…

  • Why Technical Analysis Doesn't Work

    Why Technical Analysis Doesn't Work

    Single Stock Analytics Generally speaking, one of the major attractions of working in the equities space is that the…

  • Bitcoin - Lessons from the Trenches

    Bitcoin - Lessons from the Trenches

    Back in November 2021, when Bitcoin was at $68,000, I was working with a client on a hedged cryptocurrency product to…

  • Why is Nobody Talking About Japan?

    Why is Nobody Talking About Japan?

    Almost every newspaper article and news bulletin on TV and radio is replete with updates on how the governments of…

    6 条评论
  • Does Japan Hold the Key to Corona Virus Containment?

    Does Japan Hold the Key to Corona Virus Containment?

    While recently browsing Bing's excellent Covid-19 tracker app (see here) I was struck by the extremely low number of…

    2 条评论

社区洞察

其他会员也浏览了