Apache Arrow: The future of in-memory columnar data, like dataframes
Paras Nigam
VP, Engineering | AI & Cybersecurity | Generative AI Expert | GCC & GVC | IIM Calcutta | MIT USA | 3AI Thought Leader | Entrepreneur |
I mentioned Parquet briefly in my last article, if you have not read it, do take a look here.
Parquet provides efficient columnar data processing and storage. I think it was born because of the need for an on-disk columnar format that improves compression and processing.
Today I want to talk about Apache Arrow which originated from the need for an in-memory columnar format for processing and cross-language data transfer.
What is Apache Arrow?
As the website says it is a software development platform that can be used by high-performance applications to process and transfer data, aka can be used for in-memory analytics.
Before I go further it is important to know about dataframe, I will assume not everyone reading this is aware of dataframe and sometimes they confuse it will some data structure or data API, etc. To me, a dataframe is a tabular dataset that is used for data manipulation in programming languages, the most commonly used dataframes are the pandas dataframes.
Apache Arrow can be a game changer for OLAP and get them working like realtime queries answered in milliseconds.
It promotes : DATA IS THE NEW API
Apache arrow itself is not a storage or a processing engine, it enables high performance processing and transfer of columnar data. It can provide 1000x improvement over reading from CSV, you can see some proven results posted by people over the internet. I have personally yet not found a use case of that improvement but have seen 30x improvement over playing with a small data.
BUT HOW ?
Useful Links:
Apache Arrow Overview: https://arrow.apache.org/overview/
Apache Arrow Python Support: https://arrow.apache.org/docs/python/index.html#
Apache Arrow Pandas Integration: https://arrow.apache.org/docs/python/pandas.html